Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6K views219 pages

MEGA User Instructions

MEGA User Instructions

Uploaded by

D K SRIVASTAVA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6K views219 pages

MEGA User Instructions

MEGA User Instructions

Uploaded by

D K SRIVASTAVA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 219

PART-I: GETTING STARTED

First Time User

Thank you for choosing to use MEGA in your research. This manual provides comprehensive documentation
for the MEGA software application. New users of MEGA may wish to read and follow along with
our walkthrough tutorial which attempts to touch on every major part of MEGA which you may find
useful. You may also wish to check out the newest features in MEGA.

Quick Start (useful for more technical users)

MEGA User Mode


MEGA can be used with either a graphical user interface (useful for visual exploration of data and results)
or a command-line interface (useful for batch or scripted execution).
The graphical user interface (GUI) is run in one of two modes. The first mode is the Analyze mode in which
all GUI tools in MEGA are enabled and visual results explorers are available for tasks such as editing
sequence alignments and viewing phylogenies. This is the mode that most MEGA users are familiar with.
The second mode is the Prototype mode which is used solely for generating MEGA Analysis
Options (.mao) files that specify analysis settings when using MEGA from a command shell
The command-line interface of MEGA is accessed by opening a command shell and executing
the megacc command. The megacc command requires several options, including the path to a .mao file
and paths to input data file(s) to be analyzed

Aligning Sequences (using GUI)


1. MEGA supports sequence alignment using both the ClustalW and MUSCLE programs.
2. Alignment (or refinement) is done in the Analysis Explorer (Alignment -> Open Alignment
Explorer from main menu).
3. We either can start with a blank alignment (if we are importing sequences from NCBI, or don’t have
a compatible sequence file) or from a compatible sequence file.
4. With our sequences in the Alignment Explorer (AE), we select Alignment from the menu, then
either ClustalW or Muscle.
5. Set the alignment parameters to the values you wish or leave the options alone to use the defaults.
Click Compute/OK.
6. Depending on the length and number of sequences you may see a progress bar while the alignment
is running.
7. The aligned sequences will replace the previously unaligned sequences in the Alignment Explorer.
You may now export them to MEGA or Fasta format for analysis.

Running an Analysis (using GUI)


(Note: Sequences MUST be aligned before analysis can proceed.)
1. Select the analysis you wish to run from the top toolbar in the main window.
2. You are shown a list of options for this analysis. You can only change the options which are drawn
in a white box. Click Compute.
3. Depending on the length of the analysis you may see a progress bar while the analysis is running.
4. Your output will appear as either a Tree, Matrix, Text, etc.
5. In most results there will be the option to save your analysis. This usually resides in the File or Data
menus of the results window.
1
Executing Analyses From a Command Shell
1. When using MEGA's command-line interface, all calculations are launched in the same way.
2. In the MEGA main form, click the Prototype button and then specify the type of input data that will be used
for analysis.
3. Select the analysis you wish to run from the top toolbar in the main window.
4. When the options dialog is displayed, select the desired options or use the default ones. Click Save Settings to
save the options to a .mao file.
5. From a command shell, execute the megacc command with the .mao file and input data file(s) as parameters.

See also

What's New In This Version


MEGA 11 contains a number of enhancements over previous versions. They include
• The RTDT (RelTime with Dated Tips) method for constructing timetrees calibrated using sample
times from molecular data has been updated to support calendar dates in addition to sample years.
Previously, tips could only be dated by year but now they can be dated by year, month, and day.
• The Maximum Likelihood (ML) framework in MEGA has been optimized to use memory more
efficiently so that larger data sets can be successfully analyzed in MEGA.
• All distance calculations have been moved from the main GUI thread to separate threads so that the
GUI remains responsive when working with larger data sets.
• Additional meta-data commands have been added to the .meg sequence alignment and distance
matrix file formats. These meta-data commands allow for dynamic grouping of taxa and are
supported for group-wise analyses in MEGA.
• In the Sequence Data Explorer, sites can now be automatically labelled based on site attributes (e.g.
variable sites, parsimony informative sites, etc...).
• The Sequence Data Explorer has been updated with many minor enhancements such as navigation
by highlighted sites, preservation of codons when exporting highlighted sites, highlighting labelled
sites, selecting sequences based on group attributes, and other additions.
• A feature has been added to the Tree Explorer for navigating ancestral state changes between nodes
for Maximum Likelihood and Maximum Parsimony trees.
• The Tree Explorer formatting toolbar has been completely revamped to be more user-friendly.
• A feature has been added to the Tree Explorer to auto-collapse nodes by group name, by cluster size,
or by branch length difference.

A note about compatibility between different versions of MEGA X


· Session files (*.masx, *.mtsx, *mdsx) cannot be shared between different platforms. For example, a tree
session file (file.mtsx) created on a Linux or macOS system cannot be opened on a Windows computer.
Additionally, a session file generated on a 32-bit operating system should not be opened on a 64-bit system
and vice-versa.

Analyze and Prototype Modes


The MEGA graphical user interface (GUI) can be run in two different modes depending on what is needed. The first
mode, named Analyze, is the full GUI that users of MEGA are accustomed to. In this mode, all visual tools for analyzing

2
data and exploring results are enabled. If you want to do things like construct a phylogeny and view the phylogeny in
the Tree Viewer, this is the mode you want to use.
The second mode, named Prototype, is used solely for generating MEGA Analysis Options (.mao) files that are used
with MEGA's command-line interface (MEGA-CC). In this mode, all of the data/results visualization tools are disabled.
Only the analysis menus and options dialogs are enabled. If you want to run MEGA from a command shell, this is the
mode you will use when accessing the GUI. Note to previous MEGA-CC users: the Prototype mode of the new MEGA
GUI replaces the MEGA-Proto application that was previously used with MEGA-CC. See here for an example of how
MEGA is used via command shell.
To switch between Analyze and Prototype modes, click the appropriate button in the bottom right corner of the main
MEGA window.

If switching to Prototype mode, when prompted for the data type that will be used, select from the dropdown list;

After that, you can select an analysis to execute as well as options by clicking on the appropriate top toolbar button.

Using MEGA in the Classroom


3
Because MEGA includes many statistical methods for the study of molecular evolution in an interactive
framework, it is instructive for classroom teaching. If you are interested in using MEGA in the classroom (or
common computing lab), there are no restrictions. Your students may download a copy from the
websitewww.megasoftware.net or you may install copies on multiple computers in a common computing
area (like a computing lab). However, if you want to use MEGA in any other form, please contact the authors
by e-mail ([email protected]).
If you are using MEGA in classroom teaching, please send us the following information by e-mail for our
records ([email protected]). (1) Your name, position and institution, (2) course number and title, (3)
number of students, and (4) course semester and year.
***Note for Windows systems administrators*** If you need to perform a mass installation of MEGA, the
installer can be run from a batch script or command line using the /SILENT /VERYSILENT flags.
Then MEGA will be installed with default options. For example:
MEGAX_10.1.0_win64_setup.exe /SILENT /VERYSILENT

Technical Support and Updates

All minor (bug fix) and major updates of MEGA will be made available at the
website www.megasoftware.net. You can manually check for a newer version of MEGAby clicking the
“Updates?” button which is located in the bottom of MEGA main window.

Reporting Bugs

If you encounter technical problems such as unexplained errors, documentation inconsistencies, or program
crashes, please report them to us by clicking the ‘Report Bug’ link in MEGA’s main window. Please note
that telephone inquiries will not be accepted.
Please include the following information in your report: (1) your name and email address, (2) the version
of MEGA you are working with, (3) the version of the operating system you are working in, (4) a copy of
your data file (if possible), (5) a description of the problem, and (6) the sequence of events that led to that
problem [this often is crucial to understanding and remedying the problem quickly.]

Guide to Notations Used

Item Convention Example

Directory & file names Small Cap + Bold INSTALL.TXT

File name extensions Small Cap + Bold .TXT, .DOC, .MEG


www.megasoftware.net
Email address/URLs Underlined

Pop-up help links Dotted Underlined + Green statement

Help Jumps Underlined + Blue set of rules

Menu/Screen Items Italic Data Menu

User-Entered Text Monospace font !Title

Introduction
4
This walk-through provides several brief tutorials that explain how to perform common tasks in MEGA. Each
tutorial requires the use of sample data files which can be found in the /MEGA/Examples folder (default
location for Windows users is C:\Users\UserName\Documents\MEGA7\Examples\. The location for Mac and
Linux users is $HOME/MEGA/Examples, where $HOME is the user’s home directory). It is recommended
that you follow the examples for a given tutorial in the order presented as the techniques explained in the
initial examples are used again in the subsequent ones.
In the tutorials, the following conventions are used:
• Keystrokes are indicated by bold letters (e.g., F4).
• If two keys must be pressed simultaneously, they are shown with a + sign between them
(e.g., Alt + F3 means that the Alt and F3 keys should be pressed at the same time).
• Italicized words indicate the name of a menu or window.
• Italicized bold words indicate individual commands that are found in menus, submenus, and toolbars.
• ‘Main menu’ refers to the menu bar at the top of the currently active window (File, Analysis,
Help, etc.).
• ‘Main MEGA menu’ refers to the menu on the main window of MEGA where you launch all of the
analyses from.

• ‘Launch bar’ refers to the toolbar located directly below the main menu of the currently active
window (Align, Data, Models, Distance, etc.).

• For brevity, a sequence of menu / button clicks is indicated by a sequence of commands separated by
pipes (e.g., ‘File | Open’ indicates that you should click on the ‘File’main menu item and then click
on the ‘Open’ sub menu item that is displayed).
I want to learn about:
1. Mega Basics
2. Aligning Sequences
3. Estimating Evolutionary Distances
4. Building Trees from Sequence Data
5. Testing Tree Reliability
6. Working with Genes and Domains
7. Testing for Selection
8. Managing Taxa with Groups
9. Computing Sequence Statistics
10. Building Trees from Distance Data
11. Constructing Likelihood Trees
12. Editing Data Files
13. Constructing Time Trees
14. Inferring Gene Duplications

MEGA Basics

In this tutorial, we will focus on opening and manipulating data files and saving results. All of the data files
used in this tutorial can be found in the MEGA/Examples/ folder(The default location for Windows users
5
is C:\Users\UserName\Documents\MEGA7\Examples\. The location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
You can directly access the Examples folder by clicking the ‘Examples’ button on the bottom bar of the
main window.
Active Files vs. Open Files
In order to perform any kind of calculation/analysis in MEGA you will need to provide a data file. If you
are running an analysis on a data file with sequences then you must make sure that the sequences have been
aligned prior to analysis (the sequences must be all the same length). In order to get the sequences ready
for analysis you may have to align them using the Alignment Editor which provides automated and manual
alignment facilities. If you have a file which needs editing in order to conform to one of the file format
standards you can open it up in the Text Editor for manual editing.

Viewing a Data File Using the MEGA Text Editor


From the main MEGA menu, you can open any text file for viewing and/or editing. In this example, we will
open a native MEGA text file and explore its format. This feature isn’t used too often in MEGA, but is useful
when you have a file which is corrupted or needs manual editing. If you want to start using MEGA ASAP,
you can skip this example.
Example 1.1:
To open the text editor, click File | Edit a Text File from the main MEGA menu.
In the window that opens, select File | Open and use the file browser to navigate to
the MEGA/Examples directory. Select the "Drosophila_Adh.meg" file to open it.
Examine the "Drosophila_Adh.meg" file. Notice the OTU (Operational Taxonomic Unit) names and
the interleaved sequence data. This file is in the MEGA format which is one of the two formats
which MEGA reads for analysis. The other format MEGA accepts for analysis is FASTA.
Note:
• The #mega directive specifies the format of the file, which is the MEGA format.
• The !Title directive indicates the title of the file, up to the semicolon.
• The !Description directive indicates the description of the file, up to the next semicolon.
• The !Gene directive is used to identify the gene and its properties, up to the next semicolon.
• The # directive is used to specify the sequence data.
From the text editor, you can make changes to the file such as specifying a format. Experiment with the
menu options in the editor.
Exit the text editor before proceeding with data analysis. Select the File | Exit Editor main menu option
from the text editor window. If the editor asks you if you would like to save the changes that you have made
to the file, click ‘No’.

Opening (activating) a Data File for Analysis


You can activate a data file using any of the following methods:
• From the launch bar of the main MEGA window, click Data | Open a File/Session
• Click File | Open a File/Session from the main menu of the MEGA window.
• Select an analysis or computation from MEGA’s main menu, then you will be prompted to open a
data file. You will have to supply a data file before proceeding with the calculation.
Example 1.2:

6
Now we will select a file to activate using the first method. From the main MEGA window,
select Data | Open a File/Session from the launch bar. Navigate to theExamples directory
(Mega7/Examples) and open the "Drosophila_Adh.meg," file.
Below the main MEGA launch bar you will notice that two icons appear in the main MEGA window;
a “TA” icon and a “Close Data” icon. Click the “TA” icon and you will be able to view the data you
just opened. Click the “Close Data” icon to close the data file currently opened.
Note: You can only one data file may be open at a time. You can open a different data file by going
to Data | Open a File/Session, you will see a warning which asks if you want to close the current file to
open another, just say “yes”. Each time you select an analysis, MEGA will ask if you would like to use the
currently active data. If you click “yes”, then the next analysis will use the data file you already have open,
by clicking “no” the current data file will be closed and you will be asked for a new file.
Hint: You can turn this prompt off by selecting the checkbox “Remember to use currently active data file”. MEGA will then
assume that you want to keep using that file until you open a different one or close MEGA.

Viewing Sequence Data


The Sequence Data Explorer allows you to visually explore your sequence data as well as perform a wide
range of statistical analysis based on data composition. You can activate the Sequence Data
Explorer window by using any of these methods:
• Clicking on the “TA” icon below the launch bar of the main MEGA window.
• Press F4 on the keyboard.
• Selecting Data | Explore Active Data from the launch bar of the main MEGA window.
Note: You must have a data file already opened to explore active data.
Example 1.3:
Re-open the Drosophila file as described in Example 1.2.
Select Data | Explore Active Data from the launch bar of the main MEGA window and the sequence
data will be displayed in the Sequence Data Explorer window. Leave this window open for the next
example.
Note: If you hover your mouse over the icons on the toolbar, each icon will display text describing the icons
function. This window provides several options for saving the displayed data in various formats, translating
and highlighting sequences, setting site coverage, as well as various tools for locating information within
the data file.

Translating Sequences
Using the Sequence Data Explorer, you can translate protein-coding sequences into amino acid sequences
and back using any of the following methods:
• Select the Data | Translate Sequences from the Sequence Data Explorer main window.
• Press the T key on the keyboard.
• Click the button on the Sequence Data Explorer launch bar labeled UUC -> Phe.
Note: The T key is a toggle - it turns the translation on and off. You can tell whether the data is translated
or not by clicking on the Sequence Data Explorer main menu option, Data. There will be a check mark next
to the Translate Sequences option if the data is translated.
Example 1.4:
With the Drosophila file still open in Sequence Data Explorer (from the previous example), press the
T key on the keyboard to translate the nucleotide sequences into amino acid sequences.

7
Once the sequences are translated, calculate the amino acid composition by selecting the Statistics |
Amino Acid Composition main menu command from the Sequence Data Explorer window. If you do
not have Microsoft Excel installed, we suggest you select Statistics | Display Results in Comma-
delimited (CSV) or Statistics | Display Results in Text Editor to view the results in a CSV or text
format, before running the Amino Acid Composition report. If you do have Excel, MEGA will open
an Excel workbook displaying the calculations for the amino acid composition. Except for Mac, in
which case you must save a file.
Exit out of excel.
Note: If Excel is not installed on your computer and you still select save as Excel, You will be
prompted to save the results in excel format somewhere on your hard drive.

Exporting Sequence Data


Using the Sequence Data Explorer, you can save data in the following formats: Mega, Nexus (PAUP 4.0),
Nexus (PAUP 3.0/MacClade), Phylip 3.0, Excel Workbook, or CSV (Excel Importable).
Example 1.5:
On the Sequence Data Explorer launch bar, click on the Export Data icon. The window, Exporting
Sequence Data will appear.
In this window, you can set the title and a description of the data as well as choose a format.
Note: If you choose any of these export formats, except Excel, the data will open in the Text File Editor
and Format Converter window, if you wish to save the exported text go to File | Save As. If you choose the
Excel option and Excel is installed on your computer, the data will appear in a new Excel workbook. If you
choose Excel and you do not have Excel installed on your computer, an Excel file will be created and you
will be prompted for a location in which to save it. Select any of the options except Excel.

Saving Sessions
MEGA includes a feature for saving data sessions that allows you to save translation state, highlighting,
font changes, taxa groups, genes and domains, and or other changes associated with your current file into a
single session file. If you open the saved session later, the data and all of the associated settings will be
restored automatically.
Example 1.6:
From the Sequence Data Explorer main menu, select Data | Save Session. A ‘Save As’ dialog opens
that will allow you to save the session in an “.mdsx” file at the location of your choice.
Any translation, highlighting, font changes, etc. will be saved in the resulting session. Save the file
as “Drosophila_Adh.mdsx”.
Close the Sequence Data Explorer window and the data file by clicking the Close Data icon in the
main MEGA window.
Reopen the session by selecting Data | Open a File / Session… from the launch bar of the
main MEGA window and selecting the “Drosophila_Adh.mdsx” file. Any changes made to the data
are preserved.
Close the Sequence Data Explorer window and the Drosophila file.

Viewing Distance Data


MEGA allows you to save distance data in MEGA’s native “.meg” format and later explore the data using
the Distance Data Explorer.
Example 1.7:
From the launch bar on the main MEGA window, select Data | Open a File/Session…
8
In the Open a File window, find the data file named "Distance Data.meg," then click the Open button
to activate the data file. This file sis located in theMEGA/Examples directory.
On the main MEGA window, select Data | Explore Active Data. The contents of the selected data file
will be displayed in the Distance Data Explorer window.
In the leftmost column you will see the names of the taxa listed. You can resize this column by
dragging the vertical bar at its right edge. The distances are displayed in the columns to the right.
You can change the number of decimal places displayed in the distances by clicking on the toolbar
icons labeled 0.0 (Decrease Decimal) and 0.00 (Increase Decimal).

Exporting Distance Data


Throughout MEGA, you will find viewing windows, each with its own set of toolbar icons. Wherever
appropriate, you will see a bank of "Export" icons. The set of icons being displayed depends on which
viewer you are using and the current analysis. In the Distance Data Explorer, the available export formats
and associated icons are: XL, CSV, MEGA and TXT.
Example 1.8:
Click on the icon labeled CSV. The Distance Write-out Options window will appear. Because you
clicked on the CSV icon, the Output Format is automatically set to "CSV : Comma-separated file".
Click the Print/ Save Matrix button. A new CSV file will open in the Text File Editor and File
Format Converter displaying your data. The new file is automatically given a name and saved to your
computers Temporary folder. Use the File | Save As menu option in the text editor to give the file a
different name and to specify a destination folder for it.

Calculating Average Distances


On the Distance Data Explorer task bar, you will find the Average menu. From here, you can calculate
average pair-wise distances between sequences in several ways:Overall, Within Groups, Between
Groups and Net Between Groups. Of course, in order to calculate based on groups, groups must be defined
for your data. For more information on defining groups, see the Tutorial labeled “Managing Taxa with
Groups”.
Example 1.9:
Select Average | Overall from the Distance Data Explorer main menu. A dialog is displayed that
shows the calculated overall average pair-wise distance among all selected sequences.
Close the Distance Data Explorer window by selecting the File | Quit Viewer main menu option.

Note: If you wish to continue with the tutorial, leave MEGA open. If not, close MEGA by selecting the File
| Exit MEGA menu command from the main MEGA window.
Note: If you close MEGA and then reopen it, MEGA will remember the settings you used previously for an
analysis (bootstrap, model, etc.). If the settings you used last are not applicable to the analysis you are
performing currently, MEGA will select the first available applicable options for you. MEGA tries to reuse
as many settings as it can in order to save time and effort.

Aligning Sequences

In this tutorial, we will show how to create a multiple sequence alignment from protein sequence data that
will be imported into the alignment editor using different methods. All of the data files used in this tutorial
can be found in the MEGA\Examples\ folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\\. The location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
9
Opening an Alignment
The Alignment Explorer is the tool for building and editing multiple sequence alignments in MEGA.
Example 2.1:
Launch the Alignment Explorer by selecting the Align | Edit/Build Alignment on the launch bar of
the main MEGA window.
Select Create New Alignment and click Ok. A dialog will appear asking “Are you building a DNA
or Protein sequence alignment?” Click the button labeled “DNA”.
From the Alignment Explorer main menu, select Data | Open | Retrieve sequences from File. Select
the "hsp20.fas" file from the MEG/Examples directory.

Aligning Sequences by ClustalW


You can create a multiple sequence alignment in MEGA using either the ClustalW or Muscle algorithms.
Here we align a set of sequences using the ClustalWoption.
Example 2.2:
Open the alignment file (using the instructions above) hsp20.fas.
Select the Edit | Select All menu command to select all sites for every sequence in the data set.
Select Alignment | Align by ClustalW from the main menu to align the selected sequences data using
the ClustalW algorithm. Click the “Ok” button to accept the default settings for ClustalW.
Once the alignment is complete, save the current alignment session by selecting Data | Save
Session from the main menu. Give the file an appropriate name, such as "hsp20_Test.mas". This will
allow the current alignment session to be restored for future editing.
Exit the Alignment Explorer by selecting Data | Exit Aln Explorer from the main menu.

Aligning Sequences Using Muscle


Here we describe how to create a multiple sequence alignment using the Muscle option.
Example 2.3:
Starting from the main MEGA window, select Align | Edit/Build Alignment from the launch bar.
Select Create a new alignment and then select DNA.
From the Alignment Explorer window, select Data | Open | Retrieve sequences from a file and select
the “Chloroplast_Martin.meg” file from the MEGA/Examples directory.
On the Alignment Explorer main menu, select Edit | Select All.
On the Alignment Explorer launch bar, you will find an icon that looks like a flexing arm. Click on
it and select Align DNA.
Near the bottom of the MUSCLE - AppLink window, you will see a row called Alignment Info. You
can read information about the Muscle program.
Click on the Compute button (accept the default settings). A Progress window will keep you
informed of Muscle alignment status. In this window, you can click on the Command Line Output tab
to see the command-line parameters which were passed to the Muscle program. Note: The analysis
may complete so fast, that you won’t be able to click on this tab or read it. The information in this
tab isn’t essential, it’s just interesting.
When the Muscle program has finished, the aligned sequences will be passed back to MEGA and
displayed in the Alignment Explorer window.
Close the Alignment Explorer by selecting Data | Exit Aln Explorer. Select No when asked if you
would like to save the current alignment session to file.
10
Obtaining Sequence Data from the Internet (GenBank)
Using MEGA’s integrated web browser you can fetch GenBank sequence data from the NCBI website if
you have an active internet connection.
Example 2.4:
From the main MEGA window, select Align | Edit/Build Alignment from the main menu.
When prompted, select Create New Alignment and click ok. Select DNA
Activate MEGA’s integrated browser by selecting Web | Query Genbank from the main menu.
When the NCBI: Nucleotide site is loaded, enter CFS as a search term into the search box at the top
of the screen. Press the Search button.
When the search results are displayed, check the box next to any item(s) you wish
to import into MEGA.
If you have checked more than one box: locate the Display Settings dropdown (located near
the top left hand side of the page directly under the tab headings). Change the value to FASTA
(Text) and click the Apply button. This will output all the sequences you selected as a text in
the FASTA format.
Press the Add to Alignment button (with the red + sign) located above the web address bar. This will
import the sequences into the Alignment Explorer.
With the data now displayed in the Alignment Explorer, you can close the Web Browser window.
Align the new data using the steps detailed in the previous examples.
Close the Alignment Explorer window by clicking Data | Exit Aln Explorer. Select No when asked
if you would like the save the current alignment session to file.
Note: We have aligned some sequences and they are now ready to be analyzed. Whenever you need to
edit/change your sequence data, you will need to open it in the Alignment Editor and edit or align it there.
Then export it to the MEGA format and open the resulting file.

Estimating Evolutionary Distances

In this tutorial, we will estimate evolutionary distances for sequences from 11 Drosophila species using
various models. The data files used in this tutorial can be found in theMEGA/Examples folder (The default
location for Windows users is C:\Users\UserName\Documents\MEGA7\Examples\. The default location
for Mac users is$HOME/MEGA/Examples, where $HOME is the user’s home directory).

Estimating Evolutionary Distances Using Pairwise Distance


In MEGA, you can estimate evolutionary distances between sequences by computing the proportion of
nucleotide differences between each pair of sequences.
Example 3.1:
Open the "Drosophila_Adh.meg" data file. If needed, refer to the “MEGA Basics” tutorial.
From the main MEGA launch bar, select Distance | Compute Pairwise Distance.
In the Analysis Preferences window, click the Substitutions Type pull-down and then select
the Nucleotide option.
Click the pull-down for Model/Method and select the p-distance model. For this example we will be
using the defaults for the remaining options. Click Compute to begin the computation.

11
A progress indicator will appear briefly and then the distance computation results will be displayed
in grid form in a new window. Leave this window open so we can compare the results from the next
steps.

Compute and Compare Distances Using Other Models/Methods


MEGA supports a wide collection of models for estimating evolutionary distances. Here we compare
evolutionary distances calculated by using different models.
Example 3.2:
Repeat Example 3.1 above, but select the Jukes/Cantor model under the Model/Method pull-down
instead of the p-distance model, leaving all the other options the same. Again, leave the results
window open for comparison.
Repeat the analysis, this time selecting the Tamura-Nei model under the Model/Method pull-down,
leaving all the other options the same. Again, leave the results window open for comparison.
You are now able to compare the three open result windows which contain the distances estimated
by the different methods.
After you have compared the results, select the File | Quit Viewer option for each result window. Do
not close the "Drosophila_Adh.meg" data file.

Compute the Proportion of Amino Acid Differences


You can also calculate evolutionary distances based on the proportion of amino acid differences.
Note: MEGA will automatically translate nucleotide sequences into amino acid sequences using the
selected genetic code table. The genetic code table can be edited by Data |Select Genetic Code Table from
the main MEGA launch bar.
Example 3.3:
From the main MEGA window, select Distance | Compute Pairwise Distances from the main menu.
This will display the Analysis Preferences window.
Click the Substitutions Type pull-down, select Amino Acid and then select p-
distance under Model/Method.
Click the Compute button to accept the default values for the rest of the options and begin the
computation. A progress dialog box will appear briefly. As with thenucleotide estimation, a results
viewer window will be displayed, showing the distances in a grid format.
After you have inspected the results, use the File | Quit Viewer command to close the results viewer.
Close the data by selecting the Close Data button on the main MEGA task bar.

Building Trees From Sequence Data

In this tutorial, we will illustrate the procedures for building trees and in-memory sequence data editing,
using the commands available in the Data and Phylogeny menus. We will be using the "Crab_rRNA.meg"
file which can be found in the MEGA/Examples directory. This file contains nucleotide sequences for the
large subunit mitochondrial rRNA gene from different crab species (Cunningham et al. 1992). Since the
rRNA gene is transcribed, but not translated, it falls in the category of non-coding genes.
The “Crab_rRNA.meg” file used in this tutorial can be found in the MEGA/Examples folder (The default
location for Windows users isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for
Mac users is $HOME/MEGA/Examples, where $HOME is the user’s home directory).

Building a Neighbor-Joining (NJ) Tree


12
In this example, we will illustrate the basics of phylogenetic tree re-construction using MEGA and become
familiar with the Tree Explorer window.
Example 4.1:
Activate the "Crab_rRNA.meg" data file. If necessary, refer to Example 1.2 of the “MEGA Basics”
tutorial.
From the main MEGA launch bar, select Phylogeny | Construct/Test Neighbor-Joining Tree menu
option.
In the Analysis Preferences window select the p-distance option from the Model/Method drop-down.
Click Compute to accept the defaults for the rest of the options and begin the computation. A progress
indicator will appear briefly before the tree displays in the Tree Explorer window.
To select a branch, click on it with the left mouse button. If you click on a branch with the right
mouse button, you will get a small options menu that will let you flipthe branch and perform various
other operations on it.
Select a branch and then press the Up, Down, Left, and Right arrow keys to see how the cursor
moves through the tree.
Change the branch style by selecting the View | Tree/Branch Style command from the Tree
Explorer main menu.
Select the View | Topology Only command from the Tree Explorer main menu to display the
branching pattern on the screen.
You can display the numerical branch lengths in the Topology Only option by selecting View |
Options and clicking on the Branch tab. Check the box labeled DisplayBranch Length and
click Ok.

Printing the NJ Tree (For Windows users)


Windows users can print directly from Tree Explorer.
Example 4.2a:
Select the File | Print option from the Tree Explorer main menu to bring up a standard Print window.
This will print the tree full-sized and may take multiple sheets of paper. Press Cancel.
To restrict the size of the printed tree to a single sheet of paper, choose the File | Print in a
Sheet command from the Tree Explorer main menu. Press Ok.
Select the File | Exit Tree Explorer command to exit the Tree Explorer. Click the OK button to close
the Tree Explorer without saving the tree session.

Printing the NJ Tree (For Mac users)


MEGA does not support printing directly from Tree Explorer when running on a Mac system. To print a
tree using a Mac, users can save the tree image to a PDF file and then print it by normal means.
Example 4.2b:
Select the Image | Save as PDF File option from the Tree Explorer main menu to bring up a
standard Save window. Save the image to the desired location.
Once the document is saved, you can open it with your PDF reader and print the document in the
same manner as any other PDF document.
Select the File | Exit Tree Explorer command to exit the Tree Explorer. Click the OK button to close
the Tree Explorer without saving the tree session.

13
Construct a Maximum Parsimony (MP) Tree Using the Branch-&-Bound Search Option
Using MEGA, you can re-construct a phylogeny using Maximum Likelihood, Minimum Evolution,
UPGMA, and Maximum Parsimony methods in addition to Neighbor-Joining. Here we re-construct the
phylogeny for the “Crab_rRNA.meg” data using the Maximum Parsimony (MP) method.
Example 4.3
Select the Phylogeny | Construct/Test Maximum Parsimony Tree(s) menu option from the main
MEGA launch bar. In the Analysis Preferences window, choose Max-mini Branch-&-bound for
the MP Search Method option.
Click the Compute button to accept the defaults for the other options and begin the calculation. A
progress window will appear briefly, and the tree will be displayed inTree Explorer.
(Windows users) Now print this tree by selecting either of the Print options from the Tree
Explorer's File menu.
(Mac users) Save the tree to a PDF file as described in Example 4.2b above.
Compare the NJ and MP trees. For this data set, the branching pattern of these two trees is identical.
Select the File | Exit Tree Explorer command to exit the Tree Explorer. Click OK to close Tree
Explorer without saving the tree session.

Constructing a MP Tree using the Heuristic Search


For each method of phylogenetic inference, MEGA provides numerous options. In this example, we conduct
MP analysis using the Min-Mini Heuristic search.
Example 4.4:
Follow the steps in Example 4.3 and instead of choosing Max-mini Branch-&-bound, choose Min-
Mini Heuristic for MP Search Method. Change the MP Search Levelto 2 and click Compute.
Note: In this example, the same tree is obtained by the Max-mini Branch-&-bound option as in the Min-
Mini Heuristic option as long as the MP Search Level is set to 2. However, the computational time is much
shorter for the Heuristic method.

Examining Data Editing Features


For noncoding sequence data, OTUs (Operational Taxonomic Units) as well as sites can be selected for
analysis.
Example 4.5:
From the main MEGA window select the Data | Select Taxa and Groups option from the launch bar.
A dialog box is displayed.
All the OTU labels are checked in the left panel. This indicates that all OTUs are included in the
current active data subset. To remove the first OTU from the data, uncheck the checkbox next to
the first OTU name in the left panel. Click the Close button.
Now, when you construct a neighbor-joining tree from this data set, it will contain 12 OTUs instead
of 13. Close out of the Tree Explorer window by selecting File | Exit Tree Explorer and do not save.
Deactivate the operational data set by selecting the Close Data icon from the main MEGA window.

Testing Tree Reliability

In this example, we will conduct two different tests of reliability using protein-coding genes from the
chloroplast genomes of nine different species.

14
The data file “Chloroplast_Martin.meg” which is used in this tutorial can be found in
the MEGA/Examples folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).

Bootstrap Testing for a Neighbor-Joining Tree


Example 5.1:
Activate the "Chloroplast_Martin.meg" file. If necessary, refer to Example 1.2 of “MEGA Basics”.
On the main MEGA window task bar, select the Phylogeny | Construct/Test Neighbor-Joining
Tree option.
The Analysis Preferences window appears on the screen. For the Model/Method, select p-distance.
Select Bootstrap method for the Test of Phylogeny option.
Click Compute to accept the default values for the rest of the options. A progress indicator provides
the progress of the test as well as the details of your analysis preferences.
Once the computation is complete, the Tree Explorer appears and displays two tree tabs. The first tab
is the original tree and the second is the Bootstrap consensus tree.
To produce a condensed tree, use the Compute | Condensed Tree main menu command from
the Tree Explorer window. You can further manipulate the appearance of the condensed tree here.
To change the cutoff value, select the View | Options menu command and click the Cutoff tab. For
now, keep the Cut-off value at 50% and click the OK button.
This tree shows all the branches that are supported at the default cutoff value of BCL ³ 50. Select
the Compute | Condensed Tree main menu command and the originalNJ tree will reappear.
From the Tree Explorer window, select the Image | Save as PDF File option and save a PDF image
of the tree to a convenient location.
From the Tree Explorer window, select the File | Exit Tree Explorer command to exit the Tree
Explorer. A warning box will inform you that your tree data has not been saved. Click Ok to
close Tree Explorer without saving the tree.

Interior-branch testing for the Neighbor-Joining Tree


For neighbor-joining trees, you may conduct the standard error test for every interior branch by using
the Interior branch test of phylogeny.
Example 5.2:
From the main MEGA window, select Phylogeny | Construct/Test Neighbor-Joining Tree from the
launch bar.
In the Analysis Preferences dialog, make sure the Substitutions Type option is set to Amino Acid and
the Model/Method is set to p-distance. Set the Test of Phylogenyoption to Interior-branch test.
Click Compute to begin the computation. A progress indicator window will appear briefly. When the
tree appears, confidence probabilities (CP) from the standard error test of branch lengths
are displayed on the screen.
Compare the CP values on this tree with the BCL values of the tree that you saved as a PDF file in
the previous exercise.
Now close the Tree Explorer by selecting File | Exit Tree Explorer from the main menu. Close the
current data by clicking the Close Data icon on the main MEGA window.

Working With Genes and Domains

15
Defining and Editing Gene and Domain Definitions
In this example we will demonstrate how to specify coding and non-coding regions of a sequence. We will
be using the file “Contigs.meg” which is located in theMEGA/Examples directory folder (The default
location for Windows users is C:\Users\UserName\Documents\MEGA7\Examples\. The default location
for Mac users is$HOME/MEGA/Examples, where $HOME is the user’s home directory).
Example 6.1:
Activate the data file "Contigs.meg". If necessary, refer to Example 1.2 of the “MEGA Basics”
tutorial.
From the main MEGA window launch bar, select Data | Select Genes and Domains.
Notice the column header bar across the top (‘Name’, ‘From’, ‘To’, ‘#Sites’,
‘Coding?’ 'Codon Start’). Domains will be listed under the column header labeled ‘Name’. Click on
the domain labeled Data underneath the Genes/Domains group, then click on the button
labeled Delete/Edit. Select Delete Gene/Domain to delete the datadomain.
Click on the Genes/Domains label and then click the Add Domain button. Select Add New
Domain from the popup menu.
Right-click on the new domain and select Edit Name from the popup menu. Change the name to
“Exon1” and press the Enter key.
Select the ellipses (…) button next to the first question mark in the ‘From’ column to set the first site
of the domain. When the Start site for Exon1 window appears, select site number 1 for
the AC087512 chimp row and push the Ok button.
Select the ellipsis (…) button in the ‘To’ column to set the last site of the domain. When the End site
for Exon1 window appears, select site number 3918 for theAC087512 chimp row and push
the OK button.
Check the box in the ‘Coding?’ column to indicate that this domain is protein coding. You will need
to click the box three times before the check mark appears.
Add two more domains to the Genes/Domains item using the same steps. One of these domains will
be named “Intron1” and will begin at site 3919 and end at site5191. The other will be named
“Exon2” and will begin at site 5192 and end at site 8421. Be sure to check the checkbox in
the ‘Coding?’ column for Exon2 to indicate a protein-coding domain.
Click on the Genes/Domains item to highlight it and then click the Add Gene button at the bottom of
the screen. From the popup menu choose Add new gene at the end. Right click on this new gene and
change the name to “Predicted Gene”. Click and drag all of the newly
created domains to the Predicted Gene so that they now appear under the new gene.
Press the Close button at the bottom of the window to exit the Gene/Domain Organization window.

Using Domain Definitions to Compute Pairwise Distances


Now, if we compute pairwise distances between our sequences, the non-coding regions that we specified
in the example above will be ignored.
Example 6.2:
From the main MEGA window, select the Distance | Compute Pairwise Distances option from the
launch bar.
In the Analysis Preferences window, click on the Substitutions Type drop-down and
select Nucleotide. The Select Codon Positions row is now enabled. Make sure that the Noncoding
sites option does not have a checkmark next to it. Click the Compute button to begin the analysis.
When the computation is complete, the Pairwise Distances window will display the pairwise distance
computed using only the sequence data from exonic domains of the Predicted Gene. Close
16
the Pairwise Distances window by selecting File | Quit Viewer and the Sequence Data
Explorer window by selecting the Close Data icon on the main MEGA window.

Testing for Selection

In this example, we describe how to perform a codon-based test of positive selection for five alleles from
the human HLA-A locus (Nei and Hughes 1991).
The “HLA-3Seq.meg" data file, which is used in this tutorial, can be found in
the MEGA/Examples folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).

Computing Synonymous and Non-synonymous Distances


Example 7.1:
Activate the "HLA-3Seq.meg" file. If necessary, refer to Example 1.2 in the “MEGA Basics” tutorial.
From the main MEGA window launch bar, select Selection | Codon-based Z-Test of Selection.
An Analysis Preferences window appears. For the Model/Method, select the Nei-Gojobori method
(Proportion) model.
In the Test Hypothesis (HA: alternative) row, select Positive Selection (HA: dN > dS) from the pull-
down menu.
From the Scope row, select the Overall Average option.
For the Gaps/Missing Data Treatment option, select Pairwise Deletion.
Click on "Compute" to accept the default values for the remaining options. A progress indicator
appears briefly, and then the computation results are displayed in a results window in grid format.
The column labeled "Prob" contains the probability computed (must be <0.05 for hypothesis rejection
at 5% level). The column labeled "Stat" contains the statistic used to compute the probability. The
difference in synonymous and non-synonymous substitutions should be significant at the 5% level.
Close the Test of Positive Selection window.

Managing Taxa With Groups

The “Crab_rRNA.meg” file, which is used in this tutorial, can be found in the MEGA/Examples folder (The
default location for Windows users isC:\Users\UserName\Documents\MEGA7\Examples\. The default
location for Mac users is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
Defining and Editing Groups of Taxa
In MEGA, you can partition data into distinct groups and then evaluate distances within groups, distances
between groups, and the net distance between groups.
Example 8.1:
From the main MEGA window, activate the data present in the "Crab_rRNA.meg" file. If necessary,
refer to Example 1.2 in the “MEGA Basics” tutorial.
From the main MEGA window launch bar, select Data | Select Taxa and Groups. Notice the left pane
called Taxa/Groups and the right pane labeled Ungrouped Taxa.
Press the New Group button found below the Taxa/Groups pane to add a new group to the data. Name
this new group “Pagurus” and press Enter.
While holding the Ctrl button on the keyboard, click on all of the items in the Ungrouped
Taxa pane that begin with Pagurus. This will highlight them. When they are all highlighted, press the
17
left-facing arrow button found on the vertical toolbar between the two panels (make sure
the Pagurus group on the left side is also highlighted otherwise the arrow will not appear).
Select the All group in the Taxa/Groups panel and press the + (add) button found on the vertical
toolbar between the two window panes to add a second group. Name this group "Non-Pagurus".
Add the remaining unassigned taxa to this group by using the left arrow and press the Close button
at the bottom of the window to exit this view.
Note: Now that groups have been defined, the Compute Within Group Mean, Compute Between Group
Means, and Compute Net Between Group Means menu commands from the Distance option on the launch
bar may be used to analyze the data.
Close all of the open windows.

Computing Sequence Statistics

The “Drosophila_Adh.meg” data file, which is used in this tutorial, can be found in
the MEGA/Examples folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).

Using Sequence Data Explorer


The Sequence Data Explorer provides various tools for visually analyzing sequence data as well as
calculating compositional statistics. In the following examples we will demonstrate the basic usage of
the Sequence Data Explorer.
Example 9.1:
Activate the "Drosophila_Adh.meg" file). If necessary, refer to Example 1.2 in the “MEGA Basics”
tutorial.
Select the Data | Explore Active Data (F4) command.
Use the arrow keys on your keyboard or the mouse to move from site to site. At the bottom left corner
of the window, you will find an indicator that displays the column and the total number of sites. As
you move through the columns, the column indicator changes.

Highlighting
If you look at the bottom of the Sequence Data Explorer window, the Highlighted Sites indicator displays
"None" because no special site attributes are yet highlighted.
You can highlight variable sites in various ways:
• Select the Highlight | Variable Sites main menu option on the Sequence Data Explorer main screen.
• Click the icon labeled V from the launch bar.
• Press the V key on the keyboard.
Example 9.2:
Use one of the above methods to highlight variable sites in the Drosophila data. All sites that are
variable are now highlighted. The Highlighted indicator at the bottom of the window has been
replaced with the Variable indicator. The number of sites which are variable is displayed, along with
the total number of sites (Variable sites/Total # of sites). When you press the V key again, the sites
return to the normal color. The Highlighted indicator again displays "None".
Now highlight the parsimony-informative sites by pressing the P key, clicking on the button
labeled Pi from the shortcut bar below the main menu, or selecting theHighlight | Parsim-Info
sites menu option. The Highlighted indicator turns into the Parsim-info indicator.
18
To highlight 0, 2, and 4-fold degenerate sites, press the 0, 2, or 4 keys, respectively, or click on the
corresponding buttons from the shortcut bar below the main menu, or select the corresponding
command from the Highlight menu. Once again, the Highlighter indicator will turn into the Zero-
fold indicator, Two-fold indicator, and Four-fold indicator respectively.

Statistics
The Statistics main menu option allows you to calculate Nucleotide Composition, Nucleotide
Pair Frequencies and Codon Usage. Before selecting one of these options, you will need to select whether
to use all sites or only the highlighted sites. You will also need to select the format in which you want the
results displayed.
Example 9.3:
Select Statistics | Use All Selected Sites. To display the results of the calculation in a text file using
the built-in text editor, click the Statistics menu option again and select the Display Results in Text
Editor option. To calculate the nucleotide base frequencies, select the option, Nucleotide
Composition, from the Statistics menu.
To compute codon usage, go back to the Sequence Data Explorer and select the Statistics
| Codon Usage menu command. This will calculate the codon usage and display the results of the
calculation in a text file using the built-in text editor.
To compute nucleotide pair frequencies, select the Statistics | Nucleotide Pair Frequencies |
Directional (16 pairs), or the Statistics | Nucleotide Pair Frequencies | Undirectional (10
pairs) main menu option. This will calculate the pair frequencies and display the results of the
calculation in a text file using the built-in text editor.
Note: Notice that the Amino Acid Compositions option on the Statistics menu is disabled (grayed-out). This
option is only available if the sequences have been translated.

Using the Amino Acid Composition Option


Example 9.4:
To translate these protein-coding sequences into amino acid sequences and back again,
select the Data | Translate Sequences main menu command from the Sequence Data
Explorer window.
Once the sequences are translated, calculate the amino acid composition by selecting the Statistics |
Amino Acid Composition main menu command from the Sequence Data Explorer window.
Close the Text File Editor and Format Convertor window without saving your work. Close
the Sequence Data Explorer and select Close Data icon on the main MEGA window.

Building Trees With Distance Data

This tutorial illustrates procedures for building phylogenetic trees using distance data.
The “Hum_Dist.meg” data file, which is used in this tutorial, can be found in
the MEGA/Examples folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).

Making a Phylogenetic Tree from Distance Data


Example 10.1:
Activate the "Hum_Dist.meg" file. If necessary, refer to Example 1.2 in the “MEGA Basics” tutorial.

19
From the main MEGA window, select Phylogeny | Construct/Test Neighbor-Joining Tree from the
launch bar.
The Analysis Preferences window will appear. For distance data files, all of the options shown here
cannot be changed. Click on the button labeled Compute. A progress meter will appear briefly.
The Tree Explorer will display a neighbor-joining (NJ) tree on the screen when the analysis
completes.
From the Tree Explorer launch bar, click on the i icon. The number of tabs shown here depends on
the type of tree that was constructed. For a Neighbor-Joining tree, the tabs
are General, Tree and Branch. Take a look at each to see the information they contain.
Saving your Results
MEGA allows you to save trees in MEGA’s native format or in the Newick format.
Example 10.2:
From the Tree Explorer window, select File | Save Current Session. In the Save As dialog, use
the Save in drop-down menu to select the location, and then type in a name for the session in the File
Name area. The tree will be saved with the MEGA ".mts" extension.
Now, from the Tree Explorer window, select File | Export Current Tree from the main menu. In
the Save As dialog, use the Save in drop-down to select the location. In the File Name area, type a
name for the session. The tree will be saved in Newick format with the ".nwk" extension.
Go to the File menu and click on the Exit Tree Explorer option.

Constructing Likelihood Trees

MEGA provides options for performing various calculations relating to likelihood. In this tutorial, we will
focus on the one you'll probably use most often, constructingMaximum Likelihood trees.
The “Drosophila_Adh.meg" data file, which is used in this tutorial, can be found in
the MEGA/Examples folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).

Constructing your Tree


Example 11.1:
Activate the "Drosophila_Adh.meg" file). If necessary, refer to Example 1.2 of the “MEGA Basics”
tutorial.
Select Phylogeny | Construct/Test Maximum Likelihood Tree option from the main MEGA window
launch bar.
The Analysis Preferences window will appear. For the Drosophila data file, you can choose
between Nucleotide and Amino Acid substitution types. Select Amino Acid.Now, click on the drop-
down for Models/Methods. Note the models available. Notice that the option
to Select Codon Positions is disabled for Amino Acid sequences.
Change the Substitution Type to Nucleotide. The list of Models/Methods changes, showing only
models which are applicable to nucleotide sequences. Select theTamura-Nei model. Note that the
option to Select Codon Positions is now available. Click on the button labeled Compute. A progress
indicator will appear briefly.
The Tree Explorer will display the resulting Maximum Likelihood tree on the screen.
From the Tree Explorer toolbar, click on the i icon. The number of tabs shown here depends on the
type of tree that was constructed. For a Maximum Likelihood tree, the tabs

20
are General, Tree, Branch and Character States. Take a look at each to see the information they
contain.

Saving your Tree


MEGA allows you to save trees in MEGA’s native format or in the Newick format.

Example 11.2:
From the Tree Explorer window, select File | Save Current Session from the main menu. In the Save
As dialog, use the Save in drop-down to select the location then type in a name for the session in
the File Name area. The tree will be saved with the MEGA ".mts" extension.
From the Tree Explorer window, select File | Export Current Tree from the main menu. In the Save
As dialog, use the Save in drop-down to select the location then type in a name for the session in
the File Name area. The tree will be saved in Newick “.nwk” format.
From the Tree Explorer window, select File | Exit Tree Explorer from the main menu. Click
the Ok button without saving.

Editing Data Files

There may be times when you want to make changes to a data file. With the MEGA Alignment Explorer,
you can rearrange the taxa, delete blocks of taxa or delete blocks of sites. The altered data file can then be
saved in either MEGA or FASTA format.
The “Chloroplast_Martin.meg" data file, which is used in this tutorial, can be found in
the MEGA/Examples folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).

Using Alignment Explorer


Example 12.1:
From the main MEGA window, select Align | Edit/Build Alignment. Select Create new alignment
| DNA. Then click Data | Retrieve sequences from a file and press the Ok button.
In the Open window, find and select the "Chloroplast_Martin.meg" file.

Rearranging Data
Example 12.2:
In the Alignment Explorer window, click the row header for the row named Pinus. Hold the left
mouse button down and drag the row up, then release the mouse button when the position indicator
is just below the Porphyra row.

Deleting rows
Example 12.3:
Now, click the mouse to highlight Porphyra. Select Edit | Delete on the main menu of the Alignment
Explorer. Do the same for the row Pinus.

Deleting sites
Example 12.4:
21
Click on the horizontal scroll bar at the bottom of the Alignment Explorer window. Drag it all the
way to the right. Now click on any cell in the last column. Notice that the Site # display changes to
show the highest-numbered site, 11039.
You can delete blocks of sites in the same way that you can delete rows of data. Click on the gray
header above any column of sites, hold down the left mouse button and drag across to any other
column header to select multiple columns. On the toolbar, click the X icon to delete the selected sites.

Save the altered data file


Example 12.5:
On the Alignment Explorer menu, click on Data, and then select Export Alignment. Choose
either MEGA format, FASTA format, or the PAUP format. In the Save Aswindow, select the folder
in which you want to save your data file and then type a name in the File Name area. Click
the Save button.
Close the Alignment Explorer and click Ok without saving.

Constructing a Timetree (ML)

This example shows how to generate a timetree in MEGA. For this analysis, MEGA uses a Timetree Wizard
window which will walk you through the necessary steps. The data files used in this example can be found
in the MEGA/Examples folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
Setting up the analysis
From the main MEGA window, select Clocks | Compute Time Tree | RelTime-ML. The Timetree Wizard
window, which outlines the 6 steps for creating a timetree in MEGA will be displayed.
Step1: First, we will load a sequence alignment file. In the Timetree Wizard window, click
the Browse... button and then using the file open dialog, find and select the “mtCDNA.meg” sequence
alignment file. After the alignment file is parsed by MEGA, the Load Tree File action in step 2 will become
enabled.
Step 2: Second, we will load the newick tree file which gives the topology for our timetree. Click
the Browse … button and using the file open dialog that is displayed, find and select the “mtCDNA.nwk”
tree file. After this file is parsed and validated against the sequence alignment begin used, step 3 will become
enabled.
Step 3: Next, we need to specify an outgroup taxon (we will specify one but multiple taxa can be in the
outgroup). Click the Select Taxa… button and theTaxa/Groups window will be displayed with all taxa in
our data listed in the Ungrouped Taxa list box (alternatively you can click the Select Branch… button and
use the Tree Explorer to specify the outgroup). Select the gibbon taxon and move it from
the Ungrouped Taxa list box to the Taxa in Outgroup list box by clicking the left-pointing arrow. Click
the Close button to save your changes and exit the Taxa/Groups dialog.
Step 4: Now, an option to specify divergence time calibrations constraints will become available (if this
step is skipped, then only relative times of divergence will be calculated). Click the Add
Constraints… button. MEGA will display the Calibration Editor window that is used for specifying
divergence time constraints in the timetree.
First, we will create a divergence time calibration constraint by specifying two taxa whose most recent
common ancestor is the node for which the time constraint applies. In the Calibration Editor window,
select the Calibration | Calibrate MRCA menu item (or click the add new constraint button on the upper
left toolbar [it looks like a clock with a plus sign on the bottom right]). This will create a new calibration
constraint with a default name. From the Taxon A andTaxon B dropdown lists select chimpanzee and
bonobo. The Calibration Name edit box and the MRCA Node Label edit box are populated with default
22
names but you can edit these if you like. The MRCA node label is especially useful for interpreting the
tabular Timetree output produced by MEGA’s Timetree system so that you can quickly identify calibrated
nodes by name instead of by node number. In the Min Divergence Time edit box enter 1.2. In the Max
Divergence Time edit box enter 5.0.
Next, we will create another calibration constraint by selecting a node in the tree display. In the tree display,
select the node whose descendents are orangutan and sumatran (click this node to select it. It will then have
a red diamond around it when it is selected). Select Calibration | Calibrate Selected Node menu item (or
on the upper-right toolbar, click the new divergence time constraint button [it also looks like a clock but has
a plus sign on its lower-left instead of lower-right]). This will create a new calibration. Now type 13.0 in
the Max Divergence Time edit box. Leave the Min Divergence Time Edit box blank. Click
theFinished button to complete step 4.
Step 5: Next, we can set several analysis settings such as substitution model, treatment of missing data,
etc… Back in the Timetree Wizard window, click theSet Analysis Options… button in order to open
the Analysis Preferences dialog. Click the Save button to use the default settings.
Step 6: Finally, in the Timetree Wizard window, click the Execute button. Progress will be displayed as the
analysis runs. When the analysis completes, the Tree Explorer window will return and display the time tree.

Viewing the results


In the Tree Explorer window, the calculated timetree will be displayed with absolute times of divergence
for all branching points in the tree shown. Blue diamonds indicate those nodes which were used to calibrate
the tree. To display node height error bars, click View | Show/ Hide | Node Height Error
Bars (ifbranch lengths are also shown, you can hide them by clicking View | Show/Hide | Branch
Lengths).
Select File | Export Current Tree (Timetree) and MEGA’s text editor will be displayed with a description
of the tree in tabular format.
Go back to the Tree Explorer window and select View | Show/Hide | Node Ids from the main menu. Now
the divergence times are no longer shown but nodeIds are shown. These correspond to the Node Ids in
the tabular description of the timetree in the text editor.
The first column in the text editor has node labels. The one specified in the Calibration Editor is there and
there is another one which was contained in the mtCDNA.nwk file. Open the mtCDNA.nwk file using
the text editor and find this node label.
Select View | Show/Hide | Data Coverage and the data coverage for each internal node in the tree will be
displayed.

Inferring Gene Duplications

This example shows how to identify gene duplications (and optionally speciation events) in MEGA. For this
analysis, MEGA uses a Gene Duplication Wizard window which will walk you through the necessary
steps. The data files used in this example can be found in the MEGA/Examples folder (The default location
for Windows users isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac
users is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
Setting up the analysis
From the main MEGA window, select User Tree | Find Gene Duplications. The Gene Duplication
Wizard window, which outlines the 6 steps for identifying geneduplications in MEGA will be displayed.
Step1: First, we will load a gene tree file. In the Gene Duplications Wizard window, click
the Load Gene Tree... button and then using the file open dialog, find and select the
“gene_tree.nwk” tree file in the MEGA\Examples directory. After the tree file is parsed by MEGA, the Map
Species To Taxa action in step 2 will become enabled.

23
Step 2: Second, we will provide species names for each taxon in the gene tree. Click the Map Species
Names… button and the species mapping dialog be displayed. Species names could be mapped manually
using the grid displayed in this dialog, but we will load the names from a text file that specifies the mapping
as taxon_name=species_name for each taxon in the gene tree. Click File | Import and then find
the “taxa_to_species_map.txt” file. Once MEGA loads the file, the grid will be populated with species
names for each taxon. Click the Save button to complete this step and then step 3 will become enabled.
Step 3: Next, we can optionally load a trusted species tree file. Click the Load Species Tree… button
and then using the file open dialog, find and select the “species_tree.nwk” file in
the MEGA\Examples directory. After the species tree file is parsed by MEGA, the Gene Duplication
Wizard will jump to Step 5. This is because the tree in the “gene_tree.nwk” file is already rooted so we
don’t need to specify the root to MEGA.
Step 4: We skip this step for brevity (but don’t worry, it is done exactly as in Step 5). Note – if our gene tree
was not rooted, we could optionally skip this step. In that case,MEGA would execute the analysis with all
possible placements of the root and keep the result(s) that minimize the number of gene duplications found.
Step 5: Next, must specify the placement of the root for the species tree as this is required for the
analysis. Click the Set Species Tree Root… button. The species tree will be displayed in Tree
Explorer window and the cursor will be adorned with the root placement tool icon. Click on
the branch to “puffer fish” in the tree and then click theFinished button on the toolbar at the top of the
window. MEGA will set the placement of the root internally and advance to the last step.
Step 6: Finally, in the Gene Duplications Wizard window, click the Launch Analysis button. Progress will
be displayed as the analysis runs. When the analysis completes, theTree Explorer window will return and
display the gene tree.

Viewing the results


In the Tree Explorer window, the gene tree will be displayed with gene duplications and speciation events
shown. Closed blue diamonds indicate those nodes which representgene duplication events. Open red
diamonds indicate speciation events. To display species names instead of taxa names, click View | Show/
Hide | Species Names. You can change back to taxa names by clicking View | Show/Hide
| Taxa Names). You can toggle the display of markers for gene duplications and speciation events by
clicking View| Show/Hide | Gene Duplication Markers (or Speciation Markers). You can also
traverse gene duplications or speciations throughout the tree by clicking Search
| GeneDuplication/Speciation Events.

Running in Command-Line Mode


MEGA is distributed with two user interfaces: a full graphical user interface (GUI) and a command-line interface
(MEGA-CC). The command-line interface requires a special input file called a MEGA Analysis Options (.mao) file
which specifies the analysis to run as well as the analysis options to use. This .mao file can only be created by using the
GUI interface. This example will show how to generate a .mao file with the graphical interface and then use that file to
analyze a data set with MEGA-CC.

Generating the MEGA Analysis Options File


On the main MEGA window, click the Prototype button, which is near the bottom right corner of the window, to
switch MEGA to the prototyping mode. A dialog will be shown that prompts for the data type to be used. In this
dialog, select Nucleotide and click OK.
Select Phylogeny | Construct/Test Neighbor-Joining Tree from the main MEGA windows launch bar.
The Analysis Preferences window will be shown. We will use the default options so in this dialog, just click
the Save Settings button.
When prompted to save the .mao file, navigate to your Documents\MEGA X\Examples folder and save the file as
infer_NJ_nucleotide.mao.
Close MEGA X.
Launching MEGA-CC from a command shell
24
Open a command shell and use the cd command to change to your Documents\MEGA X\Examples directory.
In the command shell, execute the following command:
megacc -a infer_NJ_nucleotide.mao -d Drosophila_Adh.meg -o demo
MEGA-CC will run the Phylogeny | Construct/Test Neighbor-Joining Tree analysis using the
Drosophila_Adh.meg sequence alignment (located in the Examples directory) and output two files - a newick file
(demo.nwk) with the newly created phylogeny and a text file (demo_summary.txt) that contains a summary of
the analysis. The same steps are also used to execute any of the other analyses that are available in megacc. To
see
the options that are available when using megacc, enter the following command in a command shell:
megacc -h

MEGA-CC Overview
Description
MEGA-CC is the command-line interface for using MEGA and it is included with the graphical interface (GUI). Most
of the calculations that are available in the graphical interface (MEGA-GUI) are also available in MEGA-CC. However,
all calculation results are saved to files on your computer instead of being displayed using graphical tools. With MEGA-
CC, you can batch process data files, automate calculation workflows, and integrate MEGA into analysis pipelines.

MEGA-CC Input Data Files


In order to run MEGA-CC, a minimum of two input files are required - a MEGA Analysis Options (.mao) file and one
or more data files that are to be analyzed:

• MEGA Analysis Options file


• Specifies the calculation and desired settings.
• Created using MEGA-GUI.
• Has a .mao file extension.
• Data file (one or more of the following)
• Multiple sequence alignment in MEGA or Fasta format.
• Distance matrix in MEGA format.
• Unaligned sequences in Fasta format (for sequence alignment only).
• Newick tree file (required for some analyses)
• Calibrations file (for timetree analysis – but it’s optional)
• Groups file (required for timetree analysis, optional for some other analyses)

MEGA-CC Output Files


All results produced by MEGA-CC are written to files on your computer:

• In general, two kinds of output files are produced


1. Calculation-specific results file (Newick file, distance matrix,…).
2. A summary file with additional info (likelihood, SBL,…).
• Some analyses produce additional output (bootstrap consensus tree, csv files, etc…).
• Output directory/filename
1. Default is the same location as the input data file.
2. Specify an output directory and/or file name using -o option.
3. If no output filename is specified, MEGA-CC will assign a unique name.
• Errors/warnings
1. If MEGA-CC produces any errors or warnings, they will be logged in the summary
file.

25
Generating a MEGA Analysis Options file

1. Set MEGA-GUI to Prototype mode by clicking the Prototype button on the main form
2. Specify a data type that will be used by selecting from a drop-down list
3. Select an analysis to run from one of the menus on the main form
4. Select analysis options in the Analysis Preferences Dialog
5. Click the Save Settings button and save to the .mao file to your computer (most likely in the same directory as
the data files to be analyzed)

Running MEGA-CC
There are multiple ways in which MEGA-CC can be used:

• Easiest to run using command-line or batch scripts:


• megacc –a settings.mao –d alignment.meg –o outFile
• Can also be run using custom scripts (Perl, Python, …):
• exec(‘megacc –a options.mao –d alignment.meg –o outFile’);
• Integrated File Iterator system can process multiple files without the need for using scripts
(see Demo2 below)
• In addition, other applications can launch MEGA-CC:
• status = CreateProcess(“path/to/megacc…”);
• To see a list of available command options, call megacc from a command-line prompt with
the –h flag (Unix users can view the man page).

See Also: A quick tutorial for using MEGA-CC.

MEGA-CC Quick Start Tutorial


Example Data Files
For this tutorial, we will use example data files that are included with MEGA. When MEGA-GUI is launched for the
first time, it will copy these example files to your computer in your Documents folder
On Windows: %HOMEPATH%\Documents\MEGA X\Examples
On *nix: ~/Documents/MEGA X/Examples

Demo 1 - Computing a Timetree Using the Reltime Method


Step 1 - Launch MEGA X and set to Prototype mode
Open MEGA X the same way other applications are opened using your operating system
In the main MEGA X window, click the Prototype button to enable the generation of .mao files

26
When prompted for the input data type to be used, select Nucleotide (Non-coding) from the drop down list

From the Clocks menu on the main form, select Compute Timetree | Reltime-ML

27
In the Analysis Preferences Dialog, accept the default settings and click the Save Settings... button. When
prompted for a location to save
the .mao file, save it in the MEGA X\Examples folder as reltime_ml_nucleotide.mao

28
The Reltime analysis requires that we specify an outgroup. Using a text editor, create a text file named
outgroup.txt and save it in the
MEGA X\Examples directory. In the text file, add a single line (gibbon=outgroup) which specifies that in our
input phylogeny, the outgroup
is comprised of a single taxon named gibbon

29
Open a command prompt and navigate to the MEGA X\Examples folder. Launch MEGA-CC by executing the
following command:

megacc -a reltime_ml_nucleotide.mao -d mtCDNA.meg -t mtCDNA.nwk -g outgroup.txt -c


mtCDNACalibration.txt -o demo1

The Reltime analysis will run and progress updates will be displayed in the command prompt window

30
The analysis will produce several output files in the MEGA X\Examples folder:
demo1_exactTimes.nwk
This Newick file gives the timetree scaled according to the estimated divergence times.

demo1_relTimes.nwk
This Newick file gives the timetree scaled according to the estimated relative divergence times.

demo1_nexus.tre
This NEXUS file outputs the timetree in NEXUS format and includes additional information such as
divergence time confidence intervals (tip: open this file in the FigTree application for advanced
visualization capabilities).

demo1.txt
This text file gives a more detailed representation of the timetree, including relative times, exact times,
evolutionary rates, and divergence time std errors.

demo1_summary.txt
This file gives analysis information such as the log likelihood value of the Maximum Likelihood tree, ts/tv
ratio, etc...

PART-II: ASSEMBLING DATA FOR ANALYSIS

Trace Data File Viewer/Editor


Using this function, you can view and edit trace data produced by an automated DNA sequencer
in ABI and Staden file formats. The sequences displayed can be added directly into the Alignment Explorer or
sent to the Web Browser for conducting BLAST searches.
A brief description of various functions available in the Trace Data file Viewer/Editor is as follows:
Data menu
Open File in New Window: Launches a new instance to view/edit another file.
Open File: Allows you to select another file to view/edit in the current window.

31
Save File: Save the current data to a file in Staden format.
Print: Prints the current trace data, excluding all masked regions.
Add to Alignment Explorer: DNA sequence data, excluding all masked regions, is sent to the Alignment
Explorer and appears as a new sequence at the end of the current alignment.
Export FASTA File: Save the active sequence data to a FASTA formatted text file.
Exit: Closes the current window.
Edit menu
Undo: Use this command to undo one or more previous actions.
Copy: This menu provides options to (1) copy DNA sequences from FASTA or plain text formats to the
clipboard and (2) copy the exact portion you are viewing of the currently displayed trace image to the clipboard
in the Windows Enhanced Meta File format. For FASTA format copying, both the sequence name and the
DNA data will be copied, excluding the masked regions. To copy only the selected portion of the sequence,
use the plain text copy command (If nothing is selected, then the plain text command will copy the entire
sequence, except for the masked regions).
Mask Upstream: Mask or unmask region to the left (upstream) of the cursor.
Mask Downstream: Mask or unmask region to the right (downstream) of the cursor.
Reverse Complement: Reverse complements the entire sequence.
Search menu
Find: Finds a specified query sequence.
Find Next: Finds the next occurrence of the query sequence. To specify the query sequence, first use
the Find menu command.
Find Previous: Finds the previous occurrence of a query sequence. To specify the query sequence, first use
the Find command.
Next N: Go to the next indeterminate (N) nucleotide.
Find in File: This command searches another file, which you specify, for the selected sequence in the current
window. It can be used when you are assembling sequence subclones to build a contig.
Do BLAST Search: Launch web browser to BLAST the currently selected sequence. If nothing is selected,
the entire sequence, excluding the masked regions, will be used.

Web Browser and Data Miner


MEGA contains a fully functional Web Browser to assist users in sequence data retrieval and web
exploration. The most important feature that differentiates this web browser from other browsers
(e.g., FireFox, Chrome or Internet Explorer) is the

button. Pressing this button causes


the MEGA Web Browser to extract sequence data from the currently displayed web page and send it to
the Alignment Explorer ’s alignment grid, where it will be inserted as new sequences. At present,
the MEGA web browser can interpret data displayed in FASTA format or Genbank, at the NCBI website.

NCBI BLAST
To import from NCBIs Blast, you need to be displaying the sequence on the screen when you press the ‘Add
to Alignment’ button. From the ‘descriptions’ section on the initial results page, you can checkmark the
sequences you wish to retrieve, then click the link labeled ‘GenBank’ at the top of the section.

32
This will display a list of the selected sequences. On this page, at the upper left corner there will be a link
labeled ‘Display Settings’. Select either ‘GenBank (full)’ or ‘FASTA (text)’, and click the Apply button.

Once the page has finished loading, click the ‘Add to Alignment’ button.

33
Furthermore, the MEGA web browser provides a genomics database, exploration oriented interface for web
searching. (In fact this is almost the same functionality as in the most recent versions of the Internet
Explorer.)

This causes the web browser window to navigate back to


the web location found before the current site in the
explorer location history.
Some webpages don’t allow you to go backwards and will
display an error. This behavior is due to the website and
not the browser and cannot be fixed.
This causes the web browser window to navigate forward
to the web location found after the current site in the
explorer location history.
This causes the web browser to reload the current web
location.

This causes the web browser to extract sequence data from


the current web page and send it to Alignment
Builder’s alignment grid as new sequence rows. If the web
explorer is unable to find properly formatted sequence data
in the current web page a warning box will appear.
Address The web location, or address field, is located in
Field the second toolbar. This field contains the URL
of the current web location as well as a pull down
list of previously visited URLs. If a new URL is
entered into the box and the Enter key is
pressed, the web explorer will attempt to
navigate to the entered URL.

34
There are number of menus in the web browser, including Data, Edit, View, Navigate, and Help. These
menus provide access to routine functionalities, which are self-explanatory in use.

TEXT EDITOR AND UTILITY

Text Editor

MEGA includes a Text File Editor, which is useful for creating and editing ASCII text files. It is invoked
automatically by MEGA if the input data file processing modules detect errors in the data file format. In
this case, you should make appropriate changes and save the data file.
The text editor is straightforward if you are familiar with programs like Notepad. Click on the section you
wish to change, type in the new text, or select text to cut, copy or paste. Only the display font can be used
in a document. You can have as many different text editor windows open at one time and you may close
them independently. However, if you have a file open in the Text Editor, you should save it and close
the Text Editor window before trying to use that data file for analysis in MEGA. Otherwise, MEGA may not
have the most up-to-date version of the data.
The Text File Editor and Format converter is a sophisticated tool with numerous special capabilities that
include:
• Large files –The ability to operate on files of virtually unlimited size and line lengths.
• General purpose –Used to view/edit any ASCII text file.
• Undo/ReDo –The availability of an unlimited depth of undo/redo options
• Search/Replace –Searches for and does block replacements for arbitrary strings.
• Clipboard – Supports familiar clipboard cut, copy, and paste operations.
• Normal and Column blocks – Supports regular contiguous line blocks and columnar blocks. This
is quite useful while manually aligning sequences in the Text Editor.
• Drag/Drop – Moves text with the familiar cut and paste operations or you can select the text and
then move it with the mouse.
• Printing –Prints the contents of the edit file.

The Text Editor contains a menu bar, a toolbar, and a status bar.
The Menu bar
Menu Description
File menu The File Menu contains the functions that are most commonly used to open, save,
rename, print, and close files. (Although there is no separate “rename” function
available, you can rename a file by choosing the Save As… menu item and giving the
file a different name before you save it.)
Edit menu The Edit Menu contains functions that are commonly used to manipulate blocks of text.
Many of the edit menu items interact with the Windows Clipboard, which is a hidden
window that allows various selections to be copied and pasted across documents and
applications.
Search menu The Search Menu has several functions that allow you to perform searches and
replacements of text strings. You can also jump directly to a specific line number in the
file.
Display menu The Display Menu contains functions that affect the visual display of files in the edit
windows.
Utilities menu The Utilities Menu contains several functions that make this editor especially useful for
working with files containing molecular sequence data (note that the MEGA editor does
35
not try to understand the contained data, it simply operates on the text, assuming that the
user knows what (s)he is doing.
Toolbar
The Toolbar contains shortcuts to some frequently used menu commands.
Status Bar
The Status bar is positioned at the bottom of the editor window. It shows the position of the cursor (line
number and position in the line), whether the file has been edited, and the status of some keyboard keys
(CAPS, NUM, and SCROLL lock).

Hotkeys and Shortcut keys


Many menu items have a hotkey and/or a shortcut key. These are special key combinations that are helpful
for people who are more comfortable using a keyboard than the mouse. Hotkeys are identified by an
underscore character in the name of the menu item, e.g., “File”, “New”. These allow you to hold down the
Alt-key, which is usually found next to the space bar on the keyboard, then hit the underlined letter to
produce the same action as if you clicked that name with the mouse. We show this using the notation
<Alt>+key – e.g., the hotkey for the file menu item is shown as <Alt>+F. Be sure that you depress both
keys together, holding the <Alt> key down a little bit longer than the letter key. (Some people try hitting
both keys simultaneously, as if they’re hitting two keys on a piano keyboard. Quite often, this approach
does not produce the desired results.)
For instance, you could create a new file by clicking the mouse on the “File” menu item, then clicking
on the “New” item beneath it. Using hotkeys, you could type <Alt>+F followed by <Alt>+N. Or, more
simply, while you’re holding down the <Alt> key, hit the ‘F’ key followed by the ‘N’ key, then release the
<Alt> key.
You might notice that several menu items, e.g., the New Item on the File menu, show something to
the right that looks like ‘Ctrl+N’. This is called a Shortcut key sequence. Whereas executing a command
with hotkeys often requires several keystrokes, shortcut keys can do the same thing with just one keystroke.
Shortcut keys work the same as hotkeys, using the <Ctrl> key instead of the <Alt> key. To create a new
file, for example, you can hold down the <Ctrl> key and hit the ‘N’ key, which is shown as <Ctrl>+N here.
(In the menus, this appears simply as ‘Ctrl+N’.)
Not all menu items have associated shortcut keys because there are only 26 shortcut keys, one for
each letter of the alphabet. Hotkeys, in contrast, are localized to each menu and submenu. For hotkeys to
work, the menu item must be visible whereas shortcut keys work at any time. For instance, if you are typing
data into a text file and want to create a note in a new window, you may simply hit the shortcut key sequence,
<Ctrl>+N to generate a new window. After you type the note, you can hit <Ctrl>+S to save it, give it a file
name, hit the enter key [this part doesn’t make sense]; then you can hit the <Alt>+F+C hotkey sequence to
close the file (there is no shortcut key for closing a file).

Format Selected Sequence


Utilities | Format Selected Sequence
This submenu presents four other menu items that offer some common ways of reformatting text.
Merge Multiple Lines: This is used to merge several separate lines into one long (very wide) line
Remove Spaces/Digits: This is used to remove spaces and digits from a genetic sequence.
Insert Spaces Every 3: This is used to break the selected text into three-character chunks (e.g., codons). Note that it
does not remove any already existing spaces.
Insert Spaces Every 10: This is used to break the selected text into ten-character chunks.

Reverse Complement
36
Utilities | Reverse Complement
This item reverses the order of characters in the selected block and then replaces each character by its complement.
Only A, T, U, C, and G are complemented; the rest of the characters are left as they are. Please use it carefully
as MEGA does not validate whether the characters in the selected block are nucleotides.

Convert to MEGA Format


Utilities | Convert to Mega Format
This item converts the sequence data in the current edit window, or in a selected file, into a MEGA format file. It
brings up a dialog box, which allows you to choose the file and/or the format for this purpose. MEGA converts the
data file and displays the converted data in the editor.
Files written in a number of popular data formats can be converted into MEGA format. MEGA supports the
conversion of CLUSTAL, NEXUS (PAUP, MacClade), PHYLIP, GCG, FASTA, PIR, NBRF, MSF, IG, and XML
formats. Details about how MEGA reads and converts these file formats are given in the section Importing Data from
Other Formats.

BUILDING SEQUENCE ALIGNMENTS

Introduction to Alignment Explorer


The Alignment Explorer provides options to (1) view and manually edit alignments and (2) generate
alignments using a built-in CLUSTALW implementation and MUSCLEprogram (for the complete sequence
or data in any rectangular region). The Alignment Explorer also provides tools for exploring web-based
databases (e.g., NCBI Query and BLAST searches) and retrieving desired sequence data directly into the
current alignment.
The Alignment Explorer has the following menus in its main
menu: Data, Edit, Search, Alignment, Display, Web, Sequencer, and Help. In addition, there are Toolbars that
provide quick access to many Alignment Explorer functions. The main Alignment Explorer window contains
up to two alignment grids.
For amino acid input sequence data, the Alignment Explorer provides only one view. However, it offers two
views of DNA sequence data: the DNA Sequences grid and the Translated Protein Sequences grid. These two
views are present in alignment grids in the two tabs with each grid displaying the sequence data for the current
alignment. Each row represents a single sequence and each column represents a site. A “*” character is used
to indicate site columns, exhibiting consensus across all sequences. An entire sequence may be selected by
clicking on the gray sequence label cell found to the left of the sequence data. An entire site may be selected
by clicking on the gray cell found above the site column. The alignment grid has the ability to assign a unique
color to each unique nucleotide or amino acid and it can display a background color for each cell in the grid.
This behavior can be controlled from the Display menu item found in the main menu. Please note that when
theClustalW (and MUSCLE) alignment algorithms are initiated, they will only align the sites currently
selected in the alignment grids. Multiple sites may be selected by clicking and then dragging the mouse within
the grid. Note that all of the manual or automatic alignment procedures carried out in the Protein Sequences
grid will be imposed on the corresponding DNA sequences as soon as you flip to the DNA sequence grid.
Even more importantly, the Alignment Explorer provides unlimited UNDO capabilities.

You may adjust the width of the sequence name column by clicking on the line which separates the sequence
names column and the start of the data column and dragging.

Aligning Sequences

In this tutorial, we will show how to create a multiple sequence alignment from protein sequence data that
will be imported into the alignment editor using different methods. All of the data files used in this tutorial
can be found in the MEGA\Examples\ folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\\. The location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
37
Opening an Alignment
The Alignment Explorer is the tool for building and editing multiple sequence alignments in MEGA.
Example 2.1:
Launch the Alignment Explorer by selecting the Align | Edit/Build Alignment on the launch bar of
the main MEGA window.
Select Create New Alignment and click Ok. A dialog will appear asking “Are you building a DNA
or Protein sequence alignment?” Click the button labeled “DNA”.
From the Alignment Explorer main menu, select Data | Open | Retrieve sequences from File. Select
the "hsp20.fas" file from the MEG/Examples directory.

Aligning Sequences by ClustalW


You can create a multiple sequence alignment in MEGA using either the ClustalW or Muscle algorithms.
Here we align a set of sequences using the ClustalWoption.
Example 2.2:
Open the alignment file (using the instructions above) hsp20.fas.
Select the Edit | Select All menu command to select all sites for every sequence in the data set.
Select Alignment | Align by ClustalW from the main menu to align the selected sequences data using
the ClustalW algorithm. Click the “Ok” button to accept the default settings for ClustalW.
Once the alignment is complete, save the current alignment session by selecting Data | Save
Session from the main menu. Give the file an appropriate name, such as "hsp20_Test.mas". This will
allow the current alignment session to be restored for future editing.
Exit the Alignment Explorer by selecting Data | Exit Aln Explorer from the main menu.

Aligning Sequences Using Muscle


Here we describe how to create a multiple sequence alignment using the Muscle option.
Example 2.3:
Starting from the main MEGA window, select Align | Edit/Build Alignment from the launch bar.
Select Create a new alignment and then select DNA.
From the Alignment Explorer window, select Data | Open | Retrieve sequences from a file and select
the “Chloroplast_Martin.meg” file from the MEGA/Examples directory.
On the Alignment Explorer main menu, select Edit | Select All.
On the Alignment Explorer launch bar, you will find an icon that looks like a flexing arm. Click on
it and select Align DNA.
Near the bottom of the MUSCLE - AppLink window, you will see a row called Alignment Info. You
can read information about the Muscle program.
Click on the Compute button (accept the default settings). A Progress window will keep you
informed of Muscle alignment status. In this window, you can click on the Command Line Output tab
to see the command-line parameters which were passed to the Muscle program. Note: The analysis
may complete so fast, that you won’t be able to click on this tab or read it. The information in this
tab isn’t essential, it’s just interesting.
When the Muscle program has finished, the aligned sequences will be passed back to MEGA and
displayed in the Alignment Explorer window.
Close the Alignment Explorer by selecting Data | Exit Aln Explorer. Select No when asked if you
would like to save the current alignment session to file.
38
Obtaining Sequence Data from the Internet (GenBank)
Using MEGA’s integrated web browser you can fetch GenBank sequence data from the NCBI website if
you have an active internet connection.
Example 2.4:
From the main MEGA window, select Align | Edit/Build Alignment from the main menu.
When prompted, select Create New Alignment and click ok. Select DNA
Activate MEGA’s integrated browser by selecting Web | Query Genbank from the main menu.
When the NCBI: Nucleotide site is loaded, enter CFS as a search term into the search box at the top
of the screen. Press the Search button.
When the search results are displayed, check the box next to any item(s) you wish
to import into MEGA.
If you have checked more than one box: locate the Display Settings dropdown (located near
the top left hand side of the page directly under the tab headings). Change the value to FASTA
(Text) and click the Apply button. This will output all the sequences you selected as a text in
the FASTA format.
Press the Add to Alignment button (with the red + sign) located above the web address bar. This will
import the sequences into the Alignment Explorer.
With the data now displayed in the Alignment Explorer, you can close the Web Browser window.
Align the new data using the steps detailed in the previous examples.
Close the Alignment Explorer window by clicking Data | Exit Aln Explorer. Select No when asked
if you would like the save the current alignment session to file.
Note: We have aligned some sequences and they are now ready to be analyzed. Whenever you need to
edit/change your sequence data, you will need to open it in the Alignment Editor and edit or align it there.
Then export it to the MEGA format and open the resulting file.

Aligning Coding Sequences via Protein Sequences


MEGA provides two convenient methods for aligning coding sequences based on the alignment of protein
sequences. In order to accomplish this using the first method,you use the Alignment Explorer to load a data
file containing protein-coding sequences. If you click on the Translated Protein Sequences tab you will see
that the protein-coding sequences are automatically translated into their respective protein sequence. With this
tab active select the Alignment|Align by ClustalW menu item or click on the “W” tool bar icon to begin the
alignment of the translated protein sequences. Once the alignment of the translated protein sequences
completes, click on the DNA Sequences tab and you’ll find that Alignment Explorer automatically aligned the
protein-coding sequences according to the aligned translated protein sequences. Any manual adjustments made
to the translated protein sequence alignment will also be reflected in the protein-coding sequence tab.
Using the second method, select the Alignment|Align by ClustalW (Codons) menu item after loading the
sequence data in Alignment Explorer. Optionally, if MEGA detects that active sequence data may be protein-
coding, clicking the “W” tool bar icon will display a drop down menu for selecting either a DNA or Coding
alignment.

About Muscle
MUSCLE is a program for generating multiple alignments of amino acid and nucleotide sequences. The speed
and accuracy of MUSCLE were compared with T-Coffee, MAFFT, and CLUSTALW and achieved the highest
or joint highest rank in accuracy in all tests. When used without refinement its accuracy is the same as T-
Coffee or MAFFT and is the fastest at aligning large sequences of the three.

39
To learn about MUSCLE please read this paper:
Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput,
Nucleic Acids Research 32(5), 1792-97
To Read
Online(http://nar.oxfordjournals.org/cgi/content/full/32/5/1792?ijkey=48Nmt1tta0fMg&keytype=ref)

Muscle Options (DNA)


This dialog box contains a set of parameters for running MUSCLE to align DNA sequences. If you wish to
align protein-coding sequences then you must first translate them to amino acids and then
align them (alternatively, select Alignment|Align by MUSCLE (Codons).
Below are descriptions of some of the parameters you may change, if you do not change them they will stay
as whatever you se them at last.
Presets:
None: Not selecting any presets
Large Alignment: If you have a large number of sequences (a few thousand) or they are very long the default
settings may be too slow for practical use. This sets the max number of iterations to 2. (command parameter:
-maxiters 2)
Fast Speed: Gives the fastest possible speeds for a result. This will compromise on accuracy. (command
parameters: -maxiters 1 -diags)
Refining Alignment: Use this when you are refining an existing alignment to make it better. (command
parameters: -refine)

Gap Opening Penalty: The penalty for opening a gap in the alignment. Increasing this value makes the gaps
less frequent.
Gap Extension Penalty: The penalty for extending a gap by one residue. Increasing this value will make the
gaps shorter. Terminal gaps are not penalized.
Max Memory in MB: MUSCLE by default specifies an upper limit on how much of your computer's memory
it may use (in Megabytes) so that it does not use all your computers resources and cause it to run slow or be
unable to operate. By default, this number is how much physical memory your computer has available. You
may be able to increase this number depending on how much virtual memory you have available.
Max Iterations: The maximum number of iterations allowed
Clustering Method (Iteration 1,2): The clustering method used in the first two iterations.
Cluster Method (Other Iterations): The clustering method used in iterations after the first two.
Max Diagonal Length: Maximum length of the diagonal.
Other Commands: You may enter parameters for MUSCLE which will be appended to the previously
selected parameters.

MUSCLE Options (Protein)


This dialog box contains a set of parameters for running MUSCLE to align PROTEIN sequences. If you wish
to align DNA sequences then you must un-translate to DNA (assuming you started with DNA) then align.
Below are descriptions of some of the parameters you may change, if you do not change them they will stay
as whatever you see them at last.

Presets:
None: Not selecting any presets
Large Alignment: If you have a large number of sequences (a few thousand) or they are very long the default
settings may be too slow for practical use. This sets the max number of iterations to 2. (command parameter:
-maxiters 2)

40
Fast Speed: Gives the fastest possible speeds for a result. This will compromise on accuracy. (command
parameters: -maxiters 1 -diags -sv -distance1 kbit20_3)
Refining Alignment: Use this when you are refining an existing alignment to make it better. (command
parameters: -refine)

Gap Opening Penalty: The penalty for opening a gap in the alignment. Increasing this value makes the gaps
less frequent.
Gap Extension Penalty: The penalty for extending a gap by one residue. Increasing this value will make the
gaps shorter. Terminal gaps are not penalized.
Hydrophobicity Multiplier: Multiplier for gap open/close penalties in hydrophobic regions.
Max Memory in MB: MUSCLE by default specifies an upper limit on how much of your computer's memory
it may use (in Megabytes) so that it does not use all your computers resources and cause it to run slow or be
unable to operate. By default this number is how much physical memory your computer has available. You
may be able to increase this number depending on how much virtual memory you have available.
Max Iterations: The maximum number of iterations allowed
Clustering Method (Iteration 1,2): The clustering method used in the first two iterations.
Cluster Method (Other Iterations): The clustering method used in iterations after the first two.
Max Diagonal Length: Maximum length of the diagonal
Other Commands: You may enter parameters for MUSCLE which will be appended to the previously
selected parameters.

About ClustalW
ClustalW is a widely used system for aligning any number of homologous nucleotide or protein sequences.
For multi-sequence alignments, ClustalW uses progressive alignment methods. In these, the most similar
sequences, that is, those with the best alignment score are aligned first. Then progressively more distant groups
of sequences are aligned until a global alignment is obtained. This heuristic approach is necessary because
finding the global optimal solution is prohibitive in both memory and time requirements. ClustalW performs
very well in practice. The algorithm starts by computing a rough distance matrix between each pair of
sequences based on pairwise sequence alignment scores. These scores are computed using the pairwise
alignment parameters for DNA and protein sequences. Next, the algorithm uses theneighbor-joining
method with midpoint rooting to create a guide tree, which is used to generate a global
alignment (alternatively, a guide tree in Newick format can be provided). The guide tree serves as a rough
template for clades that tend to share insertion and deletion features. This generally provides a close-to-optimal
result, especially when the data set contains sequences with varied degrees of divergence, so the guide tree is
less sensitive to noise.
See:
Higgins D., Thompson J., Gibson T. Thompson J. D., Higgins D. G., Gibson T. J.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence
weighting, position-specific gap penalties and weight matrix choice.
Nucleic Acids Res. 22:4673-4680. (1994)

ClustalW Options (DNA)


This dialog box displays a single tab containing a set of organized parameters that are used by ClustalW to
align the DNA sequences. If you are aligning protein-coding sequences and have not used the Align
by ClustalW (Codons) menu item, please note that ClustalW will not respect the codon positions and may
insert alignment gaps within codons. For aligning cDNA or sequence data containing codons, we recommend
that you align the translated protein sequences (see Aligning coding sequences via protein sequences).
In this dialog box, you will see the following options:

41
Parameters for Pairwise Sequence Alignment
Gap Opening Penalty: The penalty for opening a gap in the alignment. Increasing this value makes the gaps
less frequent.
Gap Extension Penalty: The penalty for extending a gap by one residue. Increasing this value will make the
gaps shorter. Terminal gaps are not penalized.
Parameters for Multiple Sequence Alignment
Gap Opening Penalty: The penalty for opening a gap in the alignment. Increasing this value makes the gaps
less frequent.
Gap Extension Penalty: The penalty for extending a gap by one residue. Increasing this value will make the
gaps shorter. Terminal gaps are not penalized.
Common Parameters
DNA Weight Matrix: The scores assigned to matches and mismatches (including IUB ambiguity codes).
Transition Weight: Gives transitions a weight between 0 and 1. A weight of zero means that the transitions
are scored as mismatches, while a weight of 1 gives the transitions the match score. For distantly-related DNA
sequences, the weight should be near zero; for closely-related sequences, it can be useful to assign a higher
score.
Use Negative Matrix: Enabled negative weight matrix values will be used if they are found; otherwise the
matrix will be automatically adjusted to all positive values.
Delay Divergent Cutoff (%): Delays the alignment of the most distantly-related sequences until after the
most closely-related sequences have been aligned. The setting shows the percent identity level required to
delay the addition of a sequence. Sequences that is less identical than this level will be aligned later.
Keep Predefined Gaps: When checked, alignment positions in which ANY of the sequences have a gap will
be ignored.
Specify Guide Tree: Browser for and select a guide tree (in Newick format) to be used for the alignment. If
this option is not used, then a Neighbor-Join tree will be created and automatically used as the guide tree.

NOTE: All Definitions are derived from the CLUSTALW manual.

Clustal Options (Protein)


This dialog box displays a single tab containing a set of organized parameters that are used by ClustalW to
align Protein sequences.
In this dialog box, you will see the following options:
Parameters for Pairwise Sequence Alignment
Gap Opening Penalty: The penalty for opening a gap in the alignment. Increasing this value makes the gaps
less frequent.
Gap Extension Penalty: The penalty for extending a gap by one residue. Increasing this value will make the
gaps shorter. Terminal gaps are not penalized.
Parameters for Multiple Sequence Alignment
Gap Opening Penalty: The penalty for opening a gap in the alignment. Increasing this value makes the gaps
less frequent.
Gap Extension Penalty: The penalty for extending a gap by one residue. Increasing this value will make the
gaps shorter. Terminal gaps are not penalized.
Common Parameters
Protein Weight Matrix: The scores assigned to matches and mismatches (including IUB ambiguity codes).
Residue-specific Penalties: Amino acid specific gap penalties that reduce or increase the gap opening
penalties at each position or sequence in the alignment. For example, positions that are rich in glycine are more
likely to have an adjacent gap than positions that are rich in valine. See the documentation for details.
Hydrophilic Penalties: Used to increase the chances of a gap within a run (5 or more residues) of hydrophilic
amino acids; these are likely to be loop or random coil regions in which gaps are more common.
42
Gap Separation Distance: Tries to decrease the chances of gaps being too close to each other. Gaps that are
less than this distance apart are penalized more than other gaps. This does not prevent close gaps; it makes
them less frequent, promoting a block-like appearance of the alignment.
Use Negative Matrix: When enabled negative weight matrix values will be used if they are found; otherwise
the matrix will be automatically adjusted to all positive values.
Delay Divergent Cutoff (%): Delays the alignment of the most distantly-related sequences until after the
alignment of the most closely-related sequences. The setting shows the percent identity level required to delay
the addition of a sequence; sequences that are less identical than this level will be aligned later.
Keep Predefined Gaps: When checked, any alignment positions in which ANY of the sequences have a gap
will be ignored.
Specify Guide Tree: Browser for and select a guide tree (in Newick format) to be used for the alignment. If
this option is not used, then a Neighbor-Join tree will be created and automatically used as the guide tree.

NOTE: All definitions are derived from CLUSTALW manual.

BLAST

About BLAST
BLAST is a widely used tool for finding matches to a query sequence within a large sequence database, such
as Genbank. BLAST is designed to look for local alignments, i.e. maximal regions of high similarity between
the query sequence and the database sequences, allowing for insertions and deletions of sites. Although the
optimal solution to this problem is computationally intractable, BLAST uses carefully designed and tested
heuristics that enable it to perform searches very rapidly (often in seconds). For each comparison, BLAST
reports a goodness score and an estimate of the expected number of matches with an equal or higher score
than would be found by chance, given the characteristics of the sequences. When this expected value is very
small, the sequence from the database is considered a “hit” and a likely homologue to the query sequence.
Versions of BLAST are available for protein and DNA sequences and are made accessible in MEGA via
the Web Browser.

See:
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool."
J. Mol. Biol. 215:403-410.

Using BLAST Within MEGA


Align | Do BLAST Search
Use this to launch the BLAST search in the MEGA Web Browser. The web-browser is displayed with the BLAST
facility at the NCBI website.

MENU ITEMS IN THE ALIGNMENT EXPLORER

Toolbar in Alignment Explorer

43
Basic Functions
This prepares Alignment Builder for a new alignment. Any sequence data currently loaded
into Alignment Builder is discarded.

This activates the Open File dialog window. It is used to send sequence data from a properly
formatted file into Alignment Builder.

This activates the Save Alignment Session dialog window. It may be used to save the current state of
the Alignment Builder into a file so that it may be restored in the future.

This causes nucleotide sequences currently loaded into Alignment Builder to be translated into their
respective amino acid sequences.

Web Browser/Data Explorer Functions


This displays the NCBI BLAST web site in the integrated Web Browser window. If a sequence in the
sequence grid is selected prior to clicking this button, the Web Explorer will auto-fill the BLAST
query window with the selected sequence data.

This displays the default database (GenBank) in the integrated Web Browser window.

This activates the Open Trace File dialog window, which may be used to open and view a sequencer
file. The sequence data from the sequencer file then can be sent into Alignment Explorer.

Alignment Functions
This displays the ClustalW parameters dialog window, which is used to configure ClustalW and
initiate the alignment of the selected sequence data. If you do not select sequence data prior to
clicking this button, a message box will appear asking if you would like to select all of the currently
loaded sequences.

This displays the MUSCLE parameters dialog window, which is used to configure MUSCLE and
initiate the alignment of the selected sequence data. If you do not select sequence data prior to
clicking this button, a message box will appear asking if you would like to select all of the currently
loaded sequences.

This marks or unmarks the currently selected single site in the alignment grid. Each sequence in the
alignment may have only one site marked at a time. Modifications can be made to the alignment by
marking two or more sites and then aligning them using the Align Marked Sites function.

This button aligns marked sites. Two or more sites must be marked in order for this function to have
an effect.

44
Search Functions
This activates the Find Motif search box. When this box appears, it asks you to enter a motif
sequence (a small subsequence of a larger sequence) as the search term. After the search term is
entered, the Alignment Builder finds each occurrence of the search term and indicates it with yellow
highlighting. For example, if you were to enter the motif “AGA” as the search term, then each
occurrence of “AGA” across all sequences in the sequence grid would be highlighted in yellow.
This searches towards the beginning of the current sequence for the first occurrence of the motif
search term. If no motif search has been performed prior to clicking this button, the Find
Motif search box will appear.

This searches towards the end of the current sequence for the first occurrence of the motif search
term. If no motif search has been performed prior to clicking this button, the Find Motif search box
will appear.

This locates the marked site in the current sequence. If no site has been marked, a warning box will
appear.

Editing Functions
This undoes the last Alignment Builder action.

This copies the current selection to the clipboard. It may be used to copy a single base, a block of
bases, or entire sequences to the clipboard.

This removes the current selection from the Alignment Builder and sends it to the clipboard. This
function can affect a single base, a block of bases, or entire sequences.

This pastes the contents of the clipboard into the Alignment Builder. If the clipboard contains a block
of bases, it will be pasted into the builder starting at the point of the current selection. If the clipboard
contains complete sequences they will be added to the current alignment. For example, if the
contents of a FASTA file were copied to the clipboard from a web browser, it would be pasted
into Alignment Builder as a new sequence in the alignment.
This deletes a block of selected bases from the alignment grid.

This deletes gap-only sites (sites containing a gap across all sequences in the alignment grid) from a
selected block of bases.

Sequence Data Insertion Functions


This creates a new, empty sequence row in the alignment grid. A label and sequence data must be
provided for this new row.

45
This activates an Open File dialog box that allows for the selection of a sequence data file. Once a
suitable sequence data file is selected, its contents will be imported into Alignment Builder as new
sequence rows in the alignment grid.

Site Number display on the status bar


Site # The Site # field indicates the site represented by the current selection. If the w/o Gaps radio button is
selected, then the Alignment Builder will disregard the shifting affect of gaps when determining gap
sites. If a block of sites are selected, then this field will contain the site # for the first site in the block.
If an entire sequence is selected this field will contain the site # for the last site in the sequence.

Menu in Alignment Explorer


This menu provides access to commands for editing the sequence data in the alignment grid. The commands
are:
Align by ClustalW: This option is used to align the DNA or protein sequence included in the current selection
on the alignment grid. You will be prompted for the alignment parameters (which are context sensitive
for DNA or Protein sequence data) to be used in ClustalW; to accept the parameters, press “OK”. This initiates
the ClustalW alignment system. Alignment Builder then aligns the current selection in the alignment grid using
the accepted parameters.
Align by ClustalW (Codons): This option is used to align (via ClustalW) the coding sequence data in the
current selection by first translating all codons to amino acids, performing the alignment, and finally replacing
the amino acids with the original codons.
Align by MUSCLE: This option is used to align the DNA or protein sequence included in the current selection
on the alignment grid. You will be prompted for the alignment parameters (DNA or Protein) to be used in
MUSCLE; to accept the parameters, press “OK”. This initiates the MUSCLE alignment system. Alignment
Builderthen aligns the current selection in the alignment grid using the accepted parameters.
Align by MUSCLE (Codons): This option is used to align (via MUSCLE) the coding sequence data in the
current selection by first translating all codons to amino acids, performing the alignment, and finally replacing
the amino acids with the original codons.
Mark/Unmark Site: This marks or unmarks a single site in the alignment grid. Each sequence in the
alignment may only have one site marked at a time. Modifications can be made to the alignment by marking
two or more sites and then aligning them using the Align Marked Sites function.
Align Marked Sites: This aligns marked sites. Two or more sites in the alignment must be marked for this
function to have an effect.
Unmark All Sites: This item unmark all currently marked sites across all sequences in the alignment grid.
Delete Gap-Only Sites: This item deletes gap-only sites (site columns containing gaps across all sequences)
from the alignment grid.
Auto-Fill Gaps: If this item is checked, then the Alignment Builder will ensure that all sequences in the
alignment grid are the same length by padding shorter sequences with gaps at the end.

Display Menu
This menu provides access to commands that control the display of toolbars in the alignment grid. The
commands in this menu are:
Toolbars: This contains a submenu of the toolbars found in Alignment Explorer. If an item is checked, then
its toolbar will be visible within the Alignment Explorerwindow.
Columns: This contains a submenu for toggling the display of species names and groups columns. If an item
is checked, then its column will be shown.
Use Colors: If checked, Alignment Explorer displays each unique base using a unique color indicating the
base type.

46
Background Color: If checked, then Alignment Explorer colors the background of each base with a unique
color that represents the base type.
Toggle Conserved Sites: Toggles on/off the display of background color for sites with a given percent of
conservation.
Font: The Font dialog window can be used to select the font used by Alignment Explorer for displaying the
sequence data in the alignment grid.

Edit Menu
This menu provides access to commands for editing the sequence data in the alignment grid. The commands
in this menu are:
Undo: This undoes the last Alignment Explorer action.
Copy: This copies the current selection to the clipboard. It may be used to copy a single base, a block of bases,
or entire sequences.
Cut: This removes the current selection from the Alignment Explorer and sends it to the clipboard. This
function can affect a single base, a block of bases, or entire sequences.
Paste: This pastes the contents of the clipboard into the Alignment Explorer. If the clipboard contains a block
of bases, they will be pasted into the builder, starting at the point of the current selection. If the clipboard
contains complete sequences, they will be added to the current alignment. For example, if the contents of a
FASTA file are copied from a web browser to the clipboard, they will be pasted into the Alignment Explorer as
a new sequence in the alignment.
Delete: This deletes a block of selected bases from the alignment grid.
Delete Gaps: This deletes gaps from a selected block of bases.
Insert Blank Sequence: This creates a new, empty sequence row in the alignment grid. A label and sequence
data must be provided for this new row.
Insert Sequence From File: This activates an Open File dialog box that allows for the selection of a sequence
data file. Once a suitable sequence data file is selected, its contents will be imported into Alignment Explorer as
new sequence rows in the alignment grid.
Select Site(s): This selects the entire site column for each site within the current selection in the alignment
grid.
Select Sequences: This selects the entire sequence for each site within the current selection in the alignment
grid.
Select all: This selects all of the sites in the alignment grid.
Allow Base Editing: If this item is checked, it changes the base values for all cells in the alignment grid. If it
is not checked, then all bases in the alignment grid are treated as read-only.
Modify All Bases to Uppercase: Changes any bases written in lowercase to uppercase.

Data Menu
This menu provides commands for creating a new alignment, opening/closing sequence data files, saving
alignment sessions to a file, exporting sequence data to a file, changing alignment sequence properties, reverse
complementing sequences in the alignment, and exiting Alignment Explorer. The commands in this menu are:
Create New Alignment: This tells Alignment Explorer to prepare for a new alignment. Any sequence data
currently loaded into Alignment Builder is discarded.
Open: This submenu provides two options: opening an existing sequence alignment session (previously saved
from Alignment Explorer), and reading a text file containing sequences in one of many formats (including,
MEGA, PAUP, FASTA, NBRF, etc.). Based on the option you choose, you will be prompted for the file name
that you wish to read.
Reopen: Displays a list of recently opened files that can be activated in Alignment Explorer.
Close: This closes the currently active data in the Alignment Explorer.

47
Phylogenetic Analysis: Clicking this item will prepare the data in the active sequence alignment for further
analysis in MEGA so that the alignment does not have to be saved to a file on disk and then reopened for
analysis in MEGA.
Save Session: This allows you to save the current sequence alignment to an alignment session. You will be
requested to give a file name to write the data to.
Export Alignment: This allows you to export the current sequence alignment to a file. There are three formats
to choose from: MEGA, FASTA or PAUP/NEXUSformats. You will be requested to give a file name to write
the data to.
DNA Sequences: Use this item to specify that the input data is DNA. If DNA is selected, then all sites are
treated as nucleotides. The Translated Protein Sequencestab contains the protein sequences. If the data is non-
coding, then ignore the second tab, as it has no affect on the on the DNA sequence tab. However, any changes
you make in the Protein Sequence tab are applied to the DNA Sequences tab window. Note that you can
UNDO these changes by using the undo button.
Protein Sequences: Use this item to specify that the input data is amino acid sequences. If selected, then all
sites are treated as amino acid residues.
Translate/Untranslate: This item only will be available if protein-coding DNA sequences are available in the
alignment grid. It will translate protein-coding DNA sequences into their respective amino acid sequences
using the selected genetic code table.
Select Genetic Code Table: This displays the Select Genetic Code dialog window, which can select the
genetic code table that is used when translating protein-coding DNA sequence data.
Reverse Complement: This becomes available when an entire sequence of row(s) is selected. It will update
the selected rows to contain the reverse compliment of the originally selected sequence(s).
Exit AlnExplorer: This closes the Alignment Explorer window and returns to the main MEGA application
window. When selected, a message box appears asking if you would like to save the current alignment
session to a file. Then a second message box appears asking if you would like to save the current alignment to
a MEGAfile. If the current alignment is saved to a MEGA file, a third message box will appear asking if you
would like to open the saved MEGA file in the main MEGAapplication.

Search Menu
This menu allows searching for sequence motifs and marked sites. The commands in this menu are:
Find Motif: This activates the Find Motif search box. When this box appears, it asks you to enter a motif
sequence (a small subsequence of a larger sequence) as the search term. After you enter the search term,
the Alignment Explorer finds each occurrence of it and indicates it with yellow highlighting. For example, if
you enter the motif “AGA” as the search term, then each occurrence of “AGA” across all sequences in the
sequence grid would be highlighted in yellow.
Find Next: This searches for the first occurrence of the motif search term towards the end of the current
sequence. If no motif search has been performed prior to clicking this button, the Find Motif search box will
appear.
Find Previous: this search towards the beginning of the current sequence for the first occurrence of the motif
search term. If no motif search has been performed prior to clicking this button, the Find Motif search box will
appear.
Find Marked Site: This locates the marked site in the current sequence. If no site has been marked for this
sequence, a warning box will appear.
Highlight Motif: If this item is checked, then all occurrences of the text search term (motif) are highlighted
in the alignment grid.

Sequencer Menu
Edit Sequencer File: This item displays the Open File dialog box used to open a sequencer data file. Once opened, the
sequencer data file is displayed in the Trace Data File Viewer/Editor. This editor allows you to view and edit trace data
produced by the automated DNA sequencer. It reads and edits data in ABI and Staden file formats and the sequences

48
displayed can be added directly into the Alignment Explorer or send to the Web Browser for
conducting BLAST searches.

Web Menu
This menu provides access to commands for querying GenBank and doing a BLAST search, as well as access
to the MEGA web Browser. The commands in this menu are:
Query Gene Banks: This item starts the Web Browser and accesses the NCBI home page
(http://www.ncbi.nlm.nih.gov).
Do BLAST Search: This item starts the Web Browser and accesses the NCBI BLAST query page. If you
select a sequence in the alignment grid prior to selecting this item, the web browser will automatically copy
the selected sequence data into the search field.
Show Browser: This item will show the Web Browser.

CONCATENATE SEQUENCE DATA

Concatenation Utility

MEGA provides a utility for concatenating multiple files containing sequence data into a single sequence alignment.
This tool is used as follows:
• All of the source alignment files that are to be concatenated should be collected and placed together into a
directory/folder on your computer. There should be no other files in this directory and all of these files should
be FASTA formatted files or MEGA formatted files. The data must all be of the same type as well (cannot mix
DNA and amino acid data).
• From the MEGA main form, click Data->Concatenate Sequence Alignments. MEGA will prompt you for the
directory/folder that contains the source alignment files and you should select that directory.
• If MEGA cannot infer the data type contained in the files, MEGA will prompt you for the data type (as well as
special symbols used such as for indels or identical bases).
• MEGA will process the input files in alphabetical order, concatenating sequences that have the same name and
adding a new sequence when a new name is encountered. Wherever needed, MEGA will add missing base
symbols (default is ?) to fill missing data so that sequence data alignment is maintained. For example, if a new
sequence is encountered in the third file processed, missing base symbols (equal to the number of bases from
the first two files) will be pre-pended to the new sequence.
• Once the concatenation is complete, the data will be imported into the Sequence Data Explorer window (press
f4 or click View->Explore Active Data on the main form to view the alignment).
• From the Sequence Data Explorer, you can export the concatenated alignment to multiple formats by clicking
Data->Export Data...

Part-III: Input Data Types and File Formats


MEGA Input Data Formats:

MEGA Format
For MEGA to read and interpret your data correctly, it should be formatted according to a set of rules. All input data
files are basic ASCII-text files, which may contain DNA sequence, protein sequence, evolutionary distance, or
phylogenetic tree data. Most word processing packages (e.g., Microsoft Word, WordPerfect, Notepad, and WordPad)
allow you to edit and save ASCII text files, which are usually marked with a .TXT extension. After creating the file, you
should change this extension to .MEG, so that you can distinguish between your data files and the other text files. Because
49
the organizational details vary for different types of data, we discuss the data formats for molecular sequences, distances,
and phylogenetic trees separately. However, there are a number of features that are common to all MEGA data files.

MEGA Saved Sessions


Session saving allows you to save the current status of MEGA into a file that you can open later to resume where
you left off.
This format is also much more efficient than the current MEGA format (.MEG) which results in much improved
save and load times. If you look at the file you will only be able to read the part of it which contains the actual
data. The rest of the file is in binary format which contains information on what the state of all the options were.
The new format has an extension of mdsx (.mdsx).

Errors with a saved session


Problem:
‘The file selected is a MEGA session file but appears to have been not fully saved when created'
Reason:
This error occurs when you attempt to open a saved session file that was not fully saved. This means, that only
part of the file was saved when it was
created. This could be caused by a user shutting down MEGA before it has finished writing the file, or a bug in
MEGA.

Problem:
'The file selected does not appear to be a Mega Session File, it may be corrupt'
Reason:
This error occurs when MEGA can not identify the file you are attempting to open as a MEGA saved session
file. Please make sure you opened the right file.
If you obtained this file from someone else please obtain another copy of the file from them and try again as the
file might have been corrupted in
transport.

Problem:
'This Mega Session was created with a newer version of Mega; only settings compatible with this version of Mega
will be restored'
Reason:
If you are opening a saved session created with a newer version of MEGA, this warning will appear. Only saved
settings that apply to your version of MEGA
will be restored.

Problem:
'This Mega Session file was created with an older version of Mega; only settings that are compatible with this
version of Mega will be restored'
Reason:
if you are opening a saved session created with an older version of MEGA, this warning will appear. Only settings
that exist in the saved session will be
applied to your version of MEGA.

GENERAL CONVERSIONS
50
Common Features
The first line must contain the keyword #MEGA to indicate that the data file is in MEGA format. The data file
may contain a succinct description of the data (called Title) included in the file on the second line.
The Title statement is written according to a set of rules and is copied from MEGA to every output file. In the long
run, an informative title will allow you to easily recognize your past work.
The data file may also contain a more descriptive multi-line account of the data in the Description statement, which
is written after the Title statement. TheDescription statement also is written according to a set of rules. Unlike the
Title statement, the Description statement is not copied from MEGA to every output file.
In addition, the data file may also contain a Format statement, which includes information on the type of data
present in the file and some of its attributes. The Formatstatement should be generally written after the Title or
the Description statement. Writing a format statement requires knowledge of the keywords used to identify
different types of data and data attributes.
All taxa names must be written according to a set of rules.
Comments can be written anywhere in the data file and can span multiple lines. They must always be enclosed in
square brackets ([and]) brackets and can be nested.

Writing Comments
The first line must contain the keyword #MEGA to indicate that the data file is in MEGA format. The data file
may contain a succinct description of the data (called Title) included in the file on the second line.
The Title statement is written according to a set of rules and is copied from MEGA to every output file. In the long
run, an informative title will allow you to easily recognize your past work.
The data file may also contain a more descriptive multi-line account of the data in the Description statement, which
is written after the Title statement. TheDescription statement also is written according to a set of rules. Unlike the
Title statement, the Description statement is not copied from MEGA to every output file.
In addition, the data file may also contain a Format statement, which includes information on the type of data
present in the file and some of its attributes. The Formatstatement should be generally written after the Title or
the Description statement. Writing a format statement requires knowledge of the keywords used to identify
different types of data and data attributes.
All taxa names must be written according to a set of rules.
Comments can be written anywhere in the data file and can span multiple lines. They must always be enclosed in
square brackets ([and]) brackets and can be nested.

Keywords
MEGA supports a number of keywords, in addition to MEGA and TITLE, for writing instructions in the format and
command statements. These key words can be written in any combination of lower- and upper-case letters. For writing
instructions, follow the style given in the examples along with the keyword description for different types of data.

Rules for Taxa Names


Distance matrices as well as sequence data may come from species, populations, or individuals. These evolutionary
entities are designated as OTUs (Operational Taxonomic Units) or taxa. Each taxon must have an identification
tag, i.e., a taxon Iabel. In the input files prepared for use in MEGA, these labels should be written according to the
following conventions:
‘#’ Sign
Every Iabel must be written on a new line, and a '#' sign must precede the label. There are no restrictions on the
length of the Iabels in the data file, but MEGA will truncate all labels longer than 40 characters. These labels are
not required to be unique, although identical labels may result in ambiguities and should be avoided.

51
Characters to use in labels
Taxa labels must start with alphanumeric characters (0-9, a-z, and A-Z) or a special character: dash (-), plus (+) or
period (.). After the first character, taxa labels may contain the following additional special characters: underscore
(_), asterisk (*), colon (:), round open and close brackets ( ), vertical line (|), back slash (\), and forward slash (/).
For multiple word labels, an underscore can be used to represent a blank space. All underscores are converted into
blank spaces, and subsequent displays of the labels show this change. For example, E._coli becomes E. coli.

Rules for Title Statement


A Title statement must be written on the line following the #mega. It always begins with !Title and ends with a
semicolon.
#mega
!Title This is an example title;
A title statement may not occupy more than one line of text. It must not contain a semicolon inside the statement,
although it must contain one at the end of the statement.

Rules for Description Statement


A Description statement is written after the Title statement. It always begins with !Description and ends with a
semicolon.
#mega
!Title This is an example title;
!Description This is detailed information the data file;
A description statement may occupy multiple lines of text. It must not contain a semicolon inside the statement,
although it must contain one at the end of the statement.

Rules for Format Statement


A format statement contains one or more command statements. A command statement contains a command and a
valid setting keyword (command=keyword format). For example, the command
statement DataType=Nucleotide tells MEGA that nucleotide sequence data is contained in the file. Based on the
DataType setting, different types of keywords are valid.

Keywords for Sequence Data

Keywords for Format Statement (Sequence Data)

Command Setting Remark Example

DataType DNA, RNA, Specifies the type of data in the file DataType=DNA
nucleotide,
protein

NSeqs A count Number of sequences NSeqs=85

NTaxa A count Synonymous withNSeqs NTaxa=85

NSites A count Number of nucleotides or amino acids Nsites=4592

52
Property Exon,Intron, Specifies whether a domain is protein Property=cyt_b
coding.Exon andCoding are
Coding,
synonymous, as
Noncoding, are Intron andNoncoding. Endspecifies
and End. that the domain with the given name
ends at this point.

Indel single Use dash (-) to identify Indel = -


character insertion/deletions in sequence
alignments

Identical single Use period (.) to show identify with Identical = .


character the first sequence.

MatchChar single Synonymous with MatchChar = .


character the identicalkeyword.

Missing single Use a question mark (?) to indicate Missing = ?


character missing data.

CodeTable A name This instruction gives the name of the CodeTable =


code table for the protein Standard
codingdomains of the data

Keywords for Distance Data

Keywords for Format Statement

Command Setting Remark Example


DataType Distance Specifies that the distance data is in the file DataType=distance

NSeqs A count Number of sequences NSeqs=85

DataFormat Lowerleft, Specifies whether the data is in lower left DataFormat=lowerleft


upperright triangular matrix or the upper right triangular
matrix

Examples below show the lower-left and the upper-right formats for a five-sequence dataset. Note that in each
case the distances are organized in a different order.

Lower-left matrix Upper-right matrix

d12 d12 d13 d14 d15

d13 d23 d23 d24 d25

d14 d24 d34 d34 d35

d15 d25 d35 d45 d45

53
SEQUENCE INPUT DATA

General Considerations
The sequence data must consist of two or more sequences of equal length. All sequences must be aligned and
you may use the in-built alignment system for this purpose. Nucleotide and amino acid sequences should be
written in IUPAC single-letter codes. Sequences can be written in any combination of upper- and lower-case
letters. Special symbols for alignment gaps, missing data, and identical sites also can be included in the
sequences.
Special Symbols
Blank spaces and tabs are frequently used to format data files, so they are simply ignored by MEGA. ASCII
characters such as the period (.), dash (-), and question mark (?), are generally used as special symbols to
represent identity to the first sequence, alignment gaps, and missing data, respectively.

IUPAC Single Letter Codes


Nucleotide or amino acid sequences should be written in IUPAC single-letter codes. The single-letter codes
supported in MEGA are as follows.

Symbols Name Remarks

DNA/RNA

A Adenine Purine

G Guanine Purine

C Cytosine Pyrimidine

T Thymine Pyrimidine

U Uracil Pyrimidine

R Purine A or G

Y Pyrimidine C or T/U

M A or C

K G or T

S Strong C or G

W Weak A or T

H Not G A or C or T

B Not A C or G or T

V Not U/T A or C or G

D Not C A or G or T

N Ambiguous A or C or G
or T

54
Protein

A Alanine Ala

C Cysteine Cys

D Aspartic Acid Asp

E Glutamic Acid Glu

F Phenylalanine Phe

G Glycine Gly

H Histidine His

I Isoleucine Ile

K Lysine Lys

L Leucine Leu

M Methionine Met

N Asparagine Asn

P Proline Pro

Q Glutamine Gln

R Arginine Arg

S Serine Ser

T Threonine Thr

V Valine Val

W Tryptophan Trp

Y Tyrosine Tyr

* Termination *

Keywords for Format Statement (Sequence Data)

Command Setting Remark Example

DataType DNA, RNA, Specifies the type of data in the file DataType=DNA
nucleotide,
protein

NSeqs A count Number of sequences NSeqs=85

NTaxa A count Synonymous withNSeqs NTaxa=85

55
NSites A count Number of nucleotides or amino acids Nsites=4592

Property Exon,Intron, Specifies whether a domain is protein Property=cyt_b


coding.Exon andCoding are synonymous, as
Coding,
are Intron andNoncoding. Endspecifies that the
Noncoding, domain with the given name ends at this point.
and End.

Indel single Use dash (-) to identify insertion/deletions in Indel = -


character sequence alignments

Identical single Use period (.) to show identify with the first Identical = .
character sequence.

MatchChar single Synonymous with the identicalkeyword. MatchChar = .


character

Missing single Use a question mark (?) to indicate missing data. Missing = ?
character

CodeTable A name This instruction gives the name of the code table CodeTable =
for the protein codingdomains of the data Standard

Defining Gene and Domains

Writing Command Statement for Defining Genes and Domains

Command Setting Remark Example

DataType DNA, RNA, Specifies the type of data in the file DataType=DNA
nucleotide,
protein

NSeqs A count Number of sequences NSeqs=85

NTaxa A count Synonymous with NSeqs NTaxa=85

NSites A count Number of nucleotides or amino acids Nsites=4592

Property Exon, Intron, Specifies whether a domain is protein Property=cyt_b


coding. Exon and Coding are synonymous, as
Coding,
are Intron andNoncoding. End specifies that the
Noncoding, domain with the given name ends at this point.
and End.

Indel single Use dash (-) to identify insertion/deletions in Indel = -


character sequence alignments

Identical single Use period (.) to show identify with the first Identical = .
character sequence.

MatchChar single Synonymous with the identical keyword. MatchChar = .


character

56
Missing single Use a question mark (?) to indicate missing data. Missing = ?
character

CodeTable A name This instruction gives the name of the code table CodeTable =
for the protein coding domains of the data Standard

Keywords for Command Statements (Genes/Domains)

Command Setting Remark Example

Domain A name This instruction defines a domain with the given Domain=first_exon
name

Gene A name This instruction defines a gene with the given Gene=cytb
name

Property Exon,Intron, This instruction specifies the protein-coding Property=cytb


attribute for a domain.
Coding,
KeywordsExon and Coding are synonymous;
Noncoding, similarly Intron and Noncoding are
and End. synonymous. End specifies the domain in which
the given name has ended.

CodonStart A number This instruction specifies the site where the next
1st-codon position will be found in a protein-
coding domain.

DEFINING GROUPS AND META DATA

Writing Command Statements for Defining Groups of Taxa and for Annotating Taxa
with Meta Data
The MEGA format allows you to assign group definitions and other meta data to the taxa in sequence alignment
files as well as to distance data files. Meta data is written in a set of curly brackets following the taxa name. The
meta data can be attached to the taxa name using an underscore or it can just be appended to the sequence name.
It is important to note that there should be no spaces between the taxa name and meta data command. (Note that
groups of taxa can also be defined interactively through a dialog box). MEGA supports the following meta data
commands (order does not matter):
group, species, population, continent, country, city, year, month, day, time
Meta data commands must adhere to the following rules:
• All fields are optional.
• Fields are defined after taxa names and in curly braces that follow an underscore (_) character.
• Fields are defined as field=value pairs.
• Fields definitions are separated by the pipe (|) character.
• Year and day fields must be integers.
• Month can be defined as an integer (1-12), the full month name (e.g. September) or a 3 letter
abbreviation (e.g. sep)

57
• Values for string-based fields (population, group, species, continent, country, city) must follow
the same rules as taxa names.
• Time must be formatted as hh:mm:ss.

The following example shows meta data commands for three pathogen sequences:
#pathogen_sample_20200520_Paris_{population=european|group=symptomatic|species=homo_sapiens|continen
t=Europe|country=France|city=Paris|year=2020|month=5|day=20|time=23:59:59}
TAATTAAAGG GCCGTGGTAT A-CTGACCAT GCGAAGGTAG CATAATCATT AGCCTTTTGA
TTTGAGGCTG
#pathogen_sample_20200610_Canberra_{population=european|group=asymptomatic|species=homo_sapiens|co
ntinent=Australia|country=Australia|city=Canberra|year=2020|month=6|day=10|time=13:59:59}
GTG..G.... ....C..... TTT.....G. .......... .......... ..T.....A. ..GA.....C
#pathogen_sample_20180402_Sydney_{population=european|group=asymptomatic|species=felis_catus|continen
t=Australia|country=Australia|city=Sydney|year=2018|month=4|day=2|time=22:58:00}
AT...G.... ....C..... TT......G. .......... .......... ..T.....A. ..G......C

In the following, we show an example in which human and mouse are designated as the members of the
group Mammal and chicken belongs to group Aves.
!Gene=FirstGene Domain=Exon1 Property=Coding;
#Human_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCTCAAT
#Mouse_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCCCAAT
#Chicken_{Aves} ATGGTTTCTAGTCAGCTCACCATGATAGGTCTCAAT

!Gene=SecondGene Domain=Intron Property=Noncoding;


#Human ATTCCCAGGGAATTCCCGGGGGGTTTAAGGCCCCTTTAAAGAAAGAT
#Mouse GTAGCGCGCGTCGTCAGAGCTCCCAAGGGTAGCAGTCACAGAAAGAT
#Chicken GTAAAAAAAAAAGTCAGAGCTCCCCCCAATATATATCACAGAAAGAT

!Gene=ThirdGene Domain=Exon2 Property=Coding;


#Human ATCTGCTCTCGAGTACTGATACAAATGACTTCTGCGTACAACTGA
#Mouse ATCTGATCTCGTGTGCTGGTACGAATGATTTCTGCGTTCAACTGA
#Chicken ATCTGCTCTCGAGTACTGCTACCAATGACTTCTGCGTACAACTGA

Using the Visual Tool


Data | Select Taxa & Groups

This invokes the Select Taxa & Groups dialog box for including or excluding taxa, defining groups of taxa, and
editing names of taxa and groups.

Defining Species and Populations

58
Beginning with MEGA7, it is possible to write command statements in .meg files for grouping taxa into species
and populations as well as groups. The syntax for adding species, group, or population specification for a taxon
is:
_{group=groupName|species=speciesName|population=populationName}
Examples:
#human_hemoglobin_subunit_alpha_{species=homo_sapiens}
#gag_specimen1_{species=homo_sapiens|population=european}
#gorilla_HBA1_{group=primates}

See also Writing Command Statements for Defining Groups of Taxa

LABELLING INDIVIDUAL SITES

What is a Site Label?


The individual sites in nucleotide or amino acid data can be labeled to construct non-contiguous sets of sites.
The Setup Genes and Domainsdialog can be used to assign or edit site labels, in addition to specifying them in the
input data files. This is shown in the following example of three-sequences in which the sites in the Third Gene are
labeled with a ‘+’ mark. An underscore marks an absence of any labels.
!Gene=FirstGene Domain=Exon1 Property=Coding;
#Human_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCTCAAT
#Mouse_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCCCAAT
#Chicken_{Aves} ATGGTTTCTAGTCAGCTCACCATGATAGGTCTCAAT

!Gene=SecondGene Domain=AnIntron Property=Noncoding;


#Human ATTCCCAGGGAATTCCCGGGGGGTTTAAGGCCCCTTTAAAGAAAGAT
#Mouse GTAGCGCGCGTCGTCAGAGCTCCCAAGGGTAGCAGTCACAGAAAGAT
#Chicken GTAAAAAAAAAAGTCAGAGCTCCCCCCAATATATATCACAGAAAGAT

!Gene=ThirdGene Domain=Exon2 Property=Coding;


#Human ATCTGCTCTCGAGTACTGATACAAATGACTTCTGCGTACAACTGA
#Mouse ATCTGATCTCGTGTGCTGGTACGAATGATTTCTGCGTTCAACTGA
#Chicken ATCTGCTCTCGAGTACTGCTACCAATGACTTCTGCGTACAACTGA
!Label +++__-+++-a-+++-L-+++-k-+++123+++-_-+++---+++;

Each site can be associated with only one label. A label can be a letter or a number.
For analyses that require codons, MEGA includes only those codons in which all three positions are given the
same label. This site labeling system facilitates the analysis of specific sites, as often is required for comparing
sequences of regulatory elements, intron-splice sites, and antigen recognition sites in the genes of applications such
as the Major Histocompatibility Complex.

How to Label Sites

59
Sites in a sequence alignment can be categorized and labeled with user-defined symbols. Each category is
represented by a letter or a number. Each site can be assigned to only one category, although any combination of
categories can be selected for analysis.
Labeled sites work independently of and in addition to genes and domains, thus allowing complex subsets of sites to
be defined easily.
Sites can be labelled in one of two ways. First, the Genes and Domains dialog (see below) has a tab named Site
Labels which provides manual site-by-site labelling as well as automatic labelling based on site attributes (variable
sites, parsimony informative sites, etc...).

Second, sites can be labelled in the MEGA sequence alignment format files following the format/example described
here.
DISTANCE INPUT DATA

General Considerations (Distance Data Format)


For a set of m sequences (or taxa), there are m(m-1)/2 pairwise distances. These distances can be arranged either
in the lower-left or in the upper-right triangular matrix. After writing the #mega,!Title,!Description,
and !Format commands (some of which are optional), you then need to write all the taxa names (see
below). Taxa names are followed by the distance matrix. An example of a matrix is:

#one
#two
#three
#four

60
#five
1.0 2.0 3.0 4.0
3.0 2.5 4.6
1.3 3.6
4.2

In the above example, pairwise distances are written in the upper triangular matrix (upper-right format). Two
alternate distance matrix formats are:

Lower-left matrix Upper-right matrix

d12 d12 d13 d14 d15

d13 d23 d23 d24 d25

d14 d24 d34 d34 d35

d15 d25 d35 d45 d45

Keywords for Format Statement

Command Setting Remark Example

DataType Distance Specifies that the distance data is in the file DataType=distance

NSeqs A count Number of sequences NSeqs=85

DataFormat Lowerleft, Specifies whether the data is in lower left DataFormat=lowerleft


upperright triangular matrix or the upper right
triangular matrix

Examples below show the lower-left and the upper-right formats for a five-sequence dataset. Note that in each
case the distances are organized in a different order.

Lower-left matrix Upper-right matrix

d12 d12 d13 d14 d15

d13 d23 d23 d24 d25

d14 d24 d34 d34 d35

d15 d25 d35 d45 d45

DEFINING GROUPS

Writing Command Statements for Defining Groups of Taxa and for Annotating Taxa
with Meta Data
61
The MEGA format allows you to assign group definitions and other meta data to the taxa in sequence alignment
files as well as to distance data files. Meta data is written in a set of curly brackets following the taxa name. The
meta data can be attached to the taxa name using an underscore or it can just be appended to the sequence name.
It is important to note that there should be no spaces between the taxa name and meta data command. (Note that
groups of taxa can also be defined interactively through a dialog box). MEGA supports the following meta data
commands (order does not matter):
group, species, population, continent, country, city, year, month, day, time
Meta data commands must adhere to the following rules:
• All fields are optional.
• Fields are defined after taxa names and in curly braces that follow an underscore (_) character.
• Fields are defined as field=value pairs.
• Fields definitions are separated by the pipe (|) character.
• Year and day fields must be integers.
• Month can be defined as an integer (1-12), the full month name (e.g. September) or a 3 letter
abbreviation (e.g. sep)
• Values for string-based fields (population, group, species, continent, country, city) must follow
the same rules as taxa names.
• Time must be formatted as hh:mm:ss.

The following example shows meta data commands for three pathogen sequences:
#pathogen_sample_20200520_Paris_{population=european|group=symptomatic|species=homo_sapiens|continen
t=Europe|country=France|city=Paris|year=2020|month=5|day=20|time=23:59:59}
TAATTAAAGG GCCGTGGTAT A-CTGACCAT GCGAAGGTAG CATAATCATT AGCCTTTTGA
TTTGAGGCTG
#pathogen_sample_20200610_Canberra_{population=european|group=asymptomatic|species=homo_sapiens|co
ntinent=Australia|country=Australia|city=Canberra|year=2020|month=6|day=10|time=13:59:59}
GTG..G.... ....C..... TTT.....G. .......... .......... ..T.....A. ..GA.....C
#pathogen_sample_20180402_Sydney_{population=european|group=asymptomatic|species=felis_catus|continen
t=Australia|country=Australia|city=Sydney|year=2018|month=4|day=2|time=22:58:00}
AT...G.... ....C..... TT......G. .......... .......... ..T.....A. ..G......C

In the following, we show an example in which human and mouse are designated as the members of the
group Mammal and chicken belongs to group Aves.
!Gene=FirstGene Domain=Exon1 Property=Coding;
#Human_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCTCAAT
#Mouse_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCCCAAT
#Chicken_{Aves} ATGGTTTCTAGTCAGCTCACCATGATAGGTCTCAAT

!Gene=SecondGene Domain=Intron Property=Noncoding;


#Human ATTCCCAGGGAATTCCCGGGGGGTTTAAGGCCCCTTTAAAGAAAGAT
#Mouse GTAGCGCGCGTCGTCAGAGCTCCCAAGGGTAGCAGTCACAGAAAGAT
#Chicken GTAAAAAAAAAAGTCAGAGCTCCCCCCAATATATATCACAGAAAGAT

62
!Gene=ThirdGene Domain=Exon2 Property=Coding;
#Human ATCTGCTCTCGAGTACTGATACAAATGACTTCTGCGTACAACTGA
#Mouse ATCTGATCTCGTGTGCTGGTACGAATGATTTCTGCGTTCAACTGA
#Chicken ATCTGCTCTCGAGTACTGCTACCAATGACTTCTGCGTACAACTGA

Using the Visual Tool


Data | Select Taxa & Groups

This invokes the Select Taxa & Groups dialog box for including or excluding taxa, defining groups of taxa, and
editing names of taxa and groups.

Defining Species and Populations


Beginning with MEGA7, it is possible to write command statements in .meg files for grouping taxa into species
and populations as well as groups. The syntax for adding species, group, or population specification for a taxon
is:
_{group=groupName|species=speciesName|population=populationName}
Examples:
#human_hemoglobin_subunit_alpha_{species=homo_sapiens}
#gag_specimen1_{species=homo_sapiens|population=european}
#gorilla_HBA1_{group=primates}

See also Writing Command Statements for Defining Groups of Taxa

TREE INPUT DATA

Display Newick Tree From File


User Tree | Display Newick Tree
Use this to retrieve and display one or more trees written in Newick format. Multiple trees can be displayed, and
their consensus built, in the Tree Explorer. MEGA supports the display of Newick format trees
containing branch lengths as well as bootstrap or other counts (note that the Newick formats do not contain the
total number of bootstrap replications conducted).

IMPORTING DATA FROM OTHER FORMATS

Importing Data
MEGA supports conversions from several different file formats into MEGA formats. Each format is indicated by
the file extension used. Supported formats include:

Extension File type

. an CLUSTAL

. nexus PAUP, MacClade

63
. phylip PHYLIP Interleaved

. phylip2 PHYLIP Noninterleaved

. gcg GCG format

. fasta FASTA format

. pir PIR format

. nbrf NBRF format

. msf MSF format

. ig IG format

. xml Internet (NCBI) XML format

The following sections briefly describe each of these formats and how MEGA handles their conversion.

COMMON FILE CONVERSION ATTRIBUTES

The default input formats are determined by a file’s extension (e.g., a file with the extension of “.ig” is initially
assumed to be in “IG” input format). However, you have the option to specify any format for any file; the file
extension is simply used as an initial guide. Note that the specification of an incorrect file format most often
results in an erroneous conversion or other unexpected error.

Input file types can include any of the following characters in their sequence data:
The letters: a-z,A-Z for DNA and protein sequences
Peroid (.)
Hyphen (-)
The space character
Question mark (?).

Depending on their context, all other characters encountered in input files are either ignored or are interpreted as
specific non-sequence data, such as comments, headers, etc.

The first line of all converted files is always: #Mega


The second line of all converted file is always: !Title: <filename>
where <filename> is the name of the input file.
The third line of all converted files is blank.

Many formats can specify the length of the sequences contained within them. The MEGA conversion utility
ignores these data and does not check to see if the sequences are as long as they are purported to be.

Convert to MEGA Format


File | Convert File Format to MEGA

64
This item allows you to choose the file and/or the format that you would like to use to convert a given sequence
data file into a MEGA format. It converts the data file and displays the converted data in the editor.
Files written in a number of popular data formats can be converted into MEGA format. MEGA supports conversion
of CLUSTAL, NEXUS (PAUP, MacClade), PHYLIP, GCG, FASTA, PIR, NBRF, MSF, IG, and XML formats.
Details about how MEGA reads and converts these file formats are given in the section Importing Data from Other
Formats.

FORMAT SPECIFIC NOTES

Converting Clustal Format


Converting CLUSTAL Format
The sequence alignment outputs from CLUSTAL software often are given the default extension .ALN.
CLUSTAL is an interleaved format. In a page-wide arrangement the sequence name is in the first column and a
part of the sequence’s data is right justified. An example of the CLUSTAL format follows:

CLUSTAL X (1.8) multiple sequence alignment

Q9Y2J0_Has ------------MTDTVFSNSSNRWMYPSDRPLQSNDKEQLQAGWSVHPG
Q06846_RP3A_BOVIN ------------MTDTVFSSSSSRWMCPSDRPLQSNDKEQLQTGWSVHPS
JX0338_rabphilin-3A-mouse ------------MTDTVVN----RWMYPGDGPLQSNDKEQLQAGWSVHPG

Q9Y2J0_Has GQPDRQRKQEELTDEEKEIINRVIARAEKMEEMEQER--IGRLVDRLENM
Q06846_RP3A_BOVIN GQPDRQRKQEELTDEEKEIINRVIARAEKMEEMEQER--IGRLVDRLENM
JX0338_rabphilin-3A-mouse AQTDRQRKQEELTDEEKEIINRVIARAEKMEAMEQER--IGRLVDRLETM

The CLUSTAL file above would be converted by MEGA into the following format:

#mega
Title: Bigrab2.aln

#Q9Y2J0_Hsa
------------MTDTVFSNSSNRWMYPSDRPLQSNDKEQLQAGWSVHPG
GQPDRQRKQEELTDEEKEIINRVIARAEKMEEMEQER--IGRLVDRLENM
RKNVAGDGVNRCILCGEQLGMLGSACVVCEDCKKNVCTKCGVET-NNRLH

#Q06846_RP3A_BOVIN
------------MTDTVFSSSSSRWMCPSDRPLQSNDKEQLQTGWSVHPS
GQPDRQRKQEELTDEEKEIINRVIARAEKMEEMEQER--IGRLVDRLENM
RKNVAGDGVNRCILCGEQLGMLGSACVVCEDCKKNVCTKCGVETSNNRPH

#JX0338_rabphilin-3A-mouse
------------MTDTVVN----RWMYPGDGPLQSNDKEQLQAGWSVHPG
AQTDRQRKQEELTDEEKEIINRVIARAEKMEAMEQER--IGRLVDRLETM
RKNVAGDGVNRCILCGEQLGMLGSACVVCEDCKKNVCTKCGVETSNNRPH
65
Converting FASTA Format
The FASTA file format is very simple and is quite similar to the MEGA file format. This is an example of a
sample input file:

>G019uabh 400 bp
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG
ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGCA
AAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC
>G028uaah 268 bp
CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAACACAAA
ATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACA
GTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATACA
TTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCT
ATAGCCTCCTTCCCCATCCCATCAGTCT

The MEGA file converter looks for a line that begin with a greater-than sign (‘>’), replaces it with a pound sign
(‘#’), takes the word following the pound sign as the sequence name, deletes the rest of the line, and takes the
following lines (up to the next line beginning with a ‘>’) as the sequence data. The MEGA file above would
convert as follows:

#mega
Title: infile.fasta

#G019uabh
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG
ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGCA
AAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC
#G028uaah
CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAACACAAA
ATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACA
GTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATACA
TTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCT
66
ATAGCCTCCTTCCCCATCCCATCAGTCT

Converting GCG Format


These files consist of one or more groups of non-blank lines separated by one or more blank lines; the non-blank
lines look similar to this:

Chloroflex
Chloroflex Length: 428 Mon Sep 25 17:34:20 MDT 2000 Check: 0 ..
1 MSKEHVQTIA TDDVSKNGHT PPTNASTPPY PFVAIVGQAE LKLALLLCVV
51 NPTIGGVMVM GHRGTAKSTA VRALAAMLPP IKAVAGCPYS CAPDRTAGLC
101 DQCRALEQQS GKTKKPAVIN IPVPVVDLPL GATEDRVCGT LDIERALTQG
151 VQAFAPGLLA RANRGFLYID EVNLLEDHLV DVLLDVAASG VNVVEREGVS
201 VRHPARFVLV GSGNPEEGDL RPQLLDRFGL HARITTITDV SERVEIVKRR
251 REYDADPFAF VEKWAKETQK LQRKIKQAQR RLPEVILPDP VLYKIAELCV
301 KLEVDGHRGE LTLARA.ATA LAALEGRNEV TVQDVRRIAV LALRHRLRKD
351 PLETQD.... ...DAVRIER AVEEVLVP.. .......... ..........
401 .......... .......... ........

The “Check” tag near the end of a line signifies the first line in a new sequence expression. The name of the
sequence is obtained from the preceding line; the following lines, up to the next blank line, are accepted as the
sequence. For each line in the sequence, the leading digits are stripped off, and the rest of the line is used. The
following shows a conversion of the above sequence.

#mega
Title: infile.gcg

#Chloroflex
MSKEHVQTIA TDDVSKNGHT PPTNASTPPY PFVAIVGQAE LKLALLLCVV
NPTIGGVMVM GHRGTAKSTA VRALAAMLPP IKAVAGCPYS CAPDRTAGLC
DQCRALEQQS GKTKKPAVIN IPVPVVDLPL GATEDRVCGT LDIERALTQG
VQAFAPGLLA RANRGFLYID EVNLLEDHLV DVLLDVAASG VNVVEREGVS
VRHPARFVLV GSGNPEEGDL RPQLLDRFGL HARITTITDV SERVEIVKRR
REYDADPFAF VEKWAKETQK LQRKIKQAQR RLPEVILPDP VLYKIAELCV
KLEVDGHRGE LTLARA.ATA LAALEGRNEV TVQDVRRIAV LALRHRLRKD
PLETQD.... ...DAVRIER AVEEVLVP.. .......... ..........
.......... .......... ........

Converting IG Format
These files consist of one or more groups of non-blank lines separated by one or more blank lines. The following
is an example of the non-blank lines:

67
;G028uaah 240 bases
G028uaah
CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTT
TAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGAT

The first line in each group begins with a semicolon. This line is ignored by MEGA. The following line
(e.g., G028uaah above) is treated as the name of the sequence. Subsequent lines, until the next semicolon, are
taken as the sequence. MEGA recognizes the letters a-z and A-Z for DNA and protein sequences and only a few
special characters, such as period [.], hyphen [-], space, and question mark [?]. Depending on their context, all
other characters in the input files are either ignored or are interpreted as specific non-sequence data, such as
comments, headers, etc.

The example converts to MEGA file format as follows:

#mega
!Title: filename
#G019uabh
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAA
GTCTTGCTTGAATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTAC

Converting MSF Format


These files consist of one or more groups of non-blank lines separated by one or more blank lines. The following
is an example of the non-blank lines:

;G028uaah 240 bases


G028uaah
CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTT
TAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGAT

The first line in each group begins with a semicolon. This line is ignored by MEGA. The following line
(e.g., G028uaah above) is treated as the name of the sequence. Subsequent lines, until the next semicolon, are
taken as the sequence. MEGA recognizes the letters a-z and A-Z for DNA and protein sequences and only a few
special characters, such as period [.], hyphen [-], space, and question mark [?]. Depending on their context, all
other characters in the input files are either ignored or are interpreted as specific non-sequence data, such as
comments, headers, etc.

The example converts to MEGA file format as follows:

#mega
!Title: filename
#G019uabh
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAA

68
GTCTTGCTTGAATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTAC

Converting NBRF Format


NBRF files consist of one or more groups of non-blank lines separated by one or more blank lines; the non-blank
lines look similar to this:

>P1;Chloroflex
Chloroflex 428 bases
MSKEHVQTIA TDDVSKNGHT PPTNASTPPY PFVAIVGQAE LKLALLLCVV
NPTIGGVMVM GHRGTAKSTA VRALAAMLPP IKAVAGCPYS CAPDRTAGLC
DQCRALEQQS GKTKKPAVIN IPVPVVDLPL GATEDRVCGT LDIERALTQG
VQAFAPGLLA RANRGFLYID EVNLLEDHLV DVLLDVAASG VNVVEREGVS
VRHPARFVLV GSGNPEEGDL RPQLLDRFGL HARITTITDV SERVEIVKRR
REYDADPFAF VEKWAKETQK LQRKIKQAQR RLPEVILPDP VLYKIAELCV
KLEVDGHRGE LTLARA-ATA LAALEGRNEV TVQDVRRIAV LALRHRLRKD
PLETQD---- ---DAVRIER AVEEVLVP-- ---------- ----------
---------- ---------- --------*

Each group begins with a line starting with a greater-than symbol (‘>’). This line is ignored. The first word in the
following line (e.g.,Chloroflex above) is treated as the name of the sequence; the rest of that line is ignored
Subsequent lines are taken as the sequence. This example would be converted to the MEGA file format as
follows:

#mega
!Title: filename

#Chloroflex
MSKEHVQTIA TDDVSKNGHT PPTNASTPPY PFVAIVGQAE LKLALLLCVV
NPTIGGVMVM GHRGTAKSTA VRALAAMLPP IKAVAGCPYS CAPDRTAGLC
DQCRALEQQS GKTKKPAVIN IPVPVVDLPL GATEDRVCGT LDIERALTQG
VQAFAPGLLA RANRGFLYID EVNLLEDHLV DVLLDVAASG VNVVEREGVS
VRHPARFVLV GSGNPEEGDL RPQLLDRFGL HARITTITDV SERVEIVKRR
REYDADPFAF VEKWAKETQK LQRKIKQAQR RLPEVILPDP VLYKIAELCV
KLEVDGHRGE LTLARA-ATA LAALEGRNEV TVQDVRRIAV LALRHRLRKD
PLETQD---- ---DAVRIER AVEEVLVP-- ---------- ----------
---------- ---------- --------

Converting NEXUS Format


The NEXUS file format has a header with lines identifying the name of each of the sequences in the file,
followed by lines that begin with the sequence name and some data. An example of part of an input file is:

69
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=17 NCHAR=428;
FORMAT DATATYPE=PROTEIN INTERLEAVE MISSING=-;
[Name: Chloroflex Len: 428 Check: 0]
[Name: Rcapsulatu Len: 428 Check: 0]
MATRIX
Chloroflex MSKEHVQTIATDDVSKNGHT PPTNASTPPYPFVAIVGQAE
Rcapsulatu ---------MTTAVARLQPS ASGAKTRPVFPFSAIVGQED

Chloroflex DQCRALEQQSGKTKKPAVIN IPVPVVDLPLGATEDRVCGT


Rcapsulatu DWATVLS-----TN---VIR KPTPVVDLPLGVSEDRVVGA

The MEGA conversion function looks for all the lines starting with the “[Name:” flag and takes the following
word as a sequence name. The conversion function then scans through the data looking for all lines starting with
each of the identified names and places them on the output. This appears as follows:

#mega
Title: infile.nexus
#Chloroflex
MSKEHVQTIATDDVSKNGHT PPTNASTPPYPFVAIVGQAE
DQCRALEQQSGKTKKPAVIN IPVPVVDLPLGATEDRVCGT

#Rcapsulatu
---------MTTAVARLQPS ASGAKTRPVFPFSAIVGQED
DWATVLS-----TN---VIR KPTPVVDLPLGVSEDRVVGA

Converting Phylip (Interleaved) Format


Converting the PHYLIP interleaved file format
The PHYLIP format is interleaved, similar to the MSF format. It consists of a line of numeric data, which is
ignored by MEGA, followed by a group of one or more lines of text. The text begins with a sequence name in the
first column and is followed by the initial part of each sequence; the group is terminated by a blank line. The
number of lines in subsequent groups of data is similar to the first group. Each line is a continuation of the
identified sequence and begins in the same position as in the first group. The following might be observed at the
beginning of a PHYLIP data file:

2 2000 I
G019uabh ATACATCATA ACACTACTTC CTACCCATAA GCTCCTTTTA ACTTGTTAAA
G028uaah CATAAGCTCC TTTTAACTTG TTAAAGTCTT GCTTGAATTA AAGACTTGTT

GTCTTGCTTG AATTAAAGAC TTGTTTAAAC ACAAAAATTT AGAGTTTTAC


TAAACACAAA ATTTAGACTT TTACTCAACA AAAGTGATTG ATTGATTGAT

70
TCAACAAAAG TGATTGATTG ATTGATTGAT TGATTGATGG TTTACAGTAG
TGATTGATTG ATGGTTTACA GTAGGACTTC ATTCTAGTCA TTATAGCTGC

MEGA would convert this data as follows:

#mega
Title: cap-data.phylip

#G019uabh
ATACATCATA ACACTACTTC CTACCCATAA GCTCCTTTTA ACTTGTTAAA
GTCTTGCTTG AATTAAAGAC TTGTTTAAAC ACAAAAATTT AGAGTTTTAC
TCAACAAAAG TGATTGATTG ATTGATTGAT TGATTGATGG TTTACAGTAG
#G028uaah
CATAAGCTCC TTTTAACTTG TTAAAGTCTT GCTTGAATTA AAGACTTGTT
TAAACACAAA ATTTAGACTT TTACTCAACA AAAGTGATTG ATTGATTGAT
TGATTGATTG ATGGTTTACA GTAGGACTTC ATTCTAGTCA TTATAGCTGC

Converting Phylip (Non-interleaved) Format


Converting PHYLIP non-interleaved format
While otherwise similar to the PHYLIP interleaved format, this format is not interleaved. For example:

00I
G019uabh ATACATCATA ACACTACTTC CTACCCATAA GCTCCTTTTA ACTTGTTAAA
GTCTTGCTTG AATTAAAGAC TTGTTTAAAC ACAAAAATTT AGAGTTTTAC
TCAACAAAAG TGATTGATTG ATTGATTGAT TGATTGATGG TTTACAGTAG
GACTTCATTC TAGTCATTAT AGCTGCTGGC AGTATAACTG GCCAGCCTTT
AATACATTGC TGCTTAGAGT CAAAGCATGT ACTTAGAGTT

G028uaah CATAAGCTCC TTTTAACTTG TTAAAGTCTT GCTTGAATTA AAGACTTGTT


TAAACACAAA ATTTAGACTT TTACTCAACA AAAGTGATTG ATTGATTGAT
TGATTGATTG ATGGTTTACA GTAGGACTTC ATTCTAGTCA TTATAGCTGC
TGGCAGTATA ACTGGCCAGC CTTTAATACA TTGCTGCTTA GAGTCAAAGC
ATGTACTTAG AGTTGGTATG ATTTATCTTT TTGGTCTTCT

This file would be converted to MEGA format as follows:

#mega
Title: infile.phylip2

71
#G019uabh
ATACATCATA ACACTACTTC CTACCCATAA GCTCCTTTTA ACTTGTTAAA
GTCTTGCTTG AATTAAAGAC TTGTTTAAAC ACAAAAATTT AGAGTTTTAC
TCAACAAAAG TGATTGATTG ATTGATTGAT TGATTGATGG TTTACAGTAG
GACTTCATTC TAGTCATTAT AGCTGCTGGC AGTATAACTG GCCAGCCTTT
AATACATTGC TGCTTAGAGT CAAAGCATGT ACTTAGAGTT

#G028uaah
CATAAGCTCC TTTTAACTTG TTAAAGTCTT GCTTGAATTA AAGACTTGTT
TAAACACAAA ATTTAGACTT TTACTCAACA AAAGTGATTG ATTGATTGAT
TGATTGATTG ATGGTTTACA GTAGGACTTC ATTCTAGTCA TTATAGCTGC
TGGCAGTATA ACTGGCCAGC CTTTAATACA TTGCTGCTTA GAGTCAAAGC
ATGTACTTAG AGTTGGTATG ATTTATCTTT TTGGTCTTCT

Converting PIR Format


These files consist of groups of non-blank lines that look similar to this:

ENTRY G006uaah
TITLE G019uabh 400 bp 240 bases
SEQUENCE
5 10 15 20 25 30
1ACATAAAATAAACTGTTTTCTATGTGAAAA
31 T T A A C C T A N N A T A T G C T T T G C T T A T G T T T A
61 A G A T G T C A T G C T T T T T A T C A G T T G A G G A G T
91 T C A G C T T A A T A A T C C T C T A A G A T C T T A A A C
121 A A A T A G G A A A A A A A C T A A A A G T A G A A A A T G
151 G A A A T A A A A T G T C A A A G C A T T T C T A C C A C T
181 C A G A A T T G A T C T T A T A A C A T G A A A T G C T T T
211 T T A A A A G A A A A T A T T A A A G T T A A A C T C C C C

The MEGA format converter looks for the “ENTRY” tag and treats the following string as the sequence name,
e.g., G006uaah above. The remaining lines have their digits and spaces removed; any non-sequence characters
also are deleted. MEGA would convert the above sequence as follows:

#mega
Title: filename.pir

#G006uaah
ACATAAAATAAACTGTTTTCTATGTGAAAA
TTAACCTANNATATGCTTTGCTTATGTTTA

72
AGATGTCATGCTTTTTATCAGTTGAGGAGT
TCAGCTTAATAATCCTCTAAGATCTTAAAC
AAATAGGAAAAAAACTAAAAGTAGAAAATG
GAAATAAAATGTCAAAGCATTTCTACCACT
CAGAATTGATCTTATAACATGAAATGCTTT
TTAAAAGAAAATATTAAAGTTAAACTCCCC

Converting XML Format


These files consist of a group of XML tags and attribute values. A DOCTYPE header may or may not be
present. The MEGA input converter for XML file formats does not implement a full parser; it only looks for a
few specific tags that might be present. For example, an XML file might contain the following data:

<Bioseq-set>
<Bioseq>
<name>G019uabh</name>
<length>240</length>
<mol>DNA</mol>
<cksum>302C447C</cksum>
<seq-
data>ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATT
AAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTGATTGATTGAT
TGATTGATGGTT
TACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATACATTG
CTGCTTAGAGT
CAAAGCATGTACTTAGAGTT</seq-data>
</Bioseq>
</Bioseq-set>

The MEGA format converter looks for the following two tags:

<name>G019uabh</name>
<seq-data>ATACATCATAACACTAC. . .</seq-data>

If it finds these tags, it uses the text between the <name>. . .</name> tags as the sequence name, and the text
between the <seq-data>. . .</seq-data> tags as the sequence data corresponding to that name. The conversion of
the above XML block into MEGAformat would look like this:

#Mega
Title: filename.xml

#G019uabh
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATT
73
AAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTGATTGATTGAT
TGATTGATGGTT
TACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATACATTG
CTGCTTAGAGT

GENETIC CODE TABLES

Built-in Genetic Codes


MEGA contains four commonly used genetic code tables: (1) Standard, (2) Vertebrate mitochondrial, (3)
Drosophila mitochondrial, and (4) Yeast mitochondrial as well as 19 others. They can be used as templates to
create additional genetic code tables using the Genetic Code Selector. Genetic codes for these four built-in tables
in one letter code are given below.

Code Table Code Table

Codon 1 2 3 4 Codon 1 2 3 4

UUU F F F F AUU I I I I

UUC F F F F AUC I I I I

UUA L L L L AUA I M M I

UUG L L L L AUG M M M M

UCU S S S S ACU T T T T

UCC S S S S ACC T T T T

UCA S S S S ACA T T T T

UCG S S S S ACG T T T T
UAU Y Y Y Y AAU N N N N

UAC Y Y Y Y AAC N N N N

UAA * * * * AAA K K K K

UAG * * * * AAG K K K K

UGU C C C C AGU S S S S
UGC C C C C AGC S S S S

UGA * W W W AGA R * S R
UGG W W W W AGG R * S R

CUU L L L T GUU V V V V

CUC L L L T GUC V V V V

CUA L L L T GUA V V V V

CUG L L L T GUG V V V V

CCU P P P P GCU A A A A

74
CCC P P P P GCC A A A A

CCA P P P P GCA A A A A
CCG P P P P GCG A A A A

CAU H H H H GAU D D D D

CAC H H H H GAC D D D D

CAA Q Q Q Q GAA E E E E

CAG Q Q Q Q GAG E E E E

CGU R R R R GGU G G G G

CGC R R R R GGC G G G G

CGA R R R R GGA G G G G

CGG R R R R GGG G G G G

Adding/Modifying Genetic Code Tables


You may add new genetic code tables and/or edit existing code tables in the Genetic Code Selector. All changes made
will be remembered byMEGA for all future analyses.

Genetic Code Statistical Attributes


There is a significant amount of redundancy in the genetic code because most amino acids are encoded by multiple
codons. Therefore, it is interesting to know the degeneracy of each codon position in all codons. In MEGA this
information can be computed for an code table in the Genetic Code Selector. In addition to
the degeneracy of codon positions, MEGA writes the number of synonymous sites and the number of non-
synonymous sites for each codon using the Nei and Gojobori(1986) method. An example of the results obtained for
the standard genetic code is given below.

Code Table: Standard


Method: Nei-Gojobori (1986) methodology
S = No. of synonymous sites
N = No. of nonsynonymous sites
Codon No. of Sites for codon Redundancy
S N Pos Pos Pos
1st 2nd 3rd
UUU (F) 0.333 2.667 0 0 2
UUC (F) 0.333 2.667 0 0 2
UUA (L) 0.667 2.333 2 0 2
UUG (L) 0.667 2.333 2 0 2
UCU (S) 1 2 0 0 4
UCC (S) 1 2 0 0 4
UCA (S) 1 2 0 0 4

75
UCG (S) 1 2 0 0 4
UAU (Y) 1 2 0 0 2
UAC (Y) 1 2 0 0 2
UAA (*) 0 3 0 0 0
UAG (*) 0 3 0 0 0
UGU (C) 0.5 2.5 0 0 2
UGC (C) 0.5 2.5 0 0 2
UGA (*) 0 3 0 0 0
UGG (W) 0 3 0 0 0
CUU (L) 1 2 0 0 4
CUC (L) 1 2 0 0 4
CUA (L) 1.333 1.667 2 0 4

Select Genetic Code Tables


Data | Select Genetic Code Table
Use the Select Genetic Code Table dialog from the Data menu to select the genetic code used by the protein-coding
nucleotide sequence data. This also allows you to add genetic codes to the list, edit existing codes, and compute a
few simple statistical properties of the chosen genetic code. This option becomes visible when you open a data set
containing nucleotide sequences.

Code Table Editor

The Code Table Editor allows you to create new genetic codes and to edit existing genetic codes. It contains the code
of the highlighted genetic code table from the previous window. To name the new genetic code or to change an
existing code, click in the 'Name' box and type the new name.
The genetic code in this editor is set up intuitively. To save space, only the amino acid encoded by a codon is shown.
The first position of the codonis shown on the left, the second position on the top, and the third position on the
right. To find the codon for any given entry on the screen, position your mouse over the desired amino acid and wait
for a moment; a yellow hint will be displayed.

To change the amino acid encoded by any codon, click and scroll down to choose the desired amino acid.
Alternatively, once the codon has been selected, type in the first letter of the name of the amino acid and the program
will jump to that part of the list. To indicate a stop codon, select '***' or type *.

Once you have made all the required changes to the name and codons, click OK. Otherwise, click Cancel. We
recommend that you check the altered genetic code using the View option to make sure that the changes have been
properly interpreted by MEGA.

VIEWING AND EXPLORING INPUT DATA

Sequence Data Explorer

76
The Sequence Data Explorer shows the aligned sequence data. You can scroll along the alignment using the scrollbar
at the bottom right hand side of the explorer window. The Sequence Data Explorer provides a number of utilities for
exploring the statistical attributes of the data and also for selecting data subsets.

This explorer consists of a number of regions as follows:


Menu Bar
Data menu
Search menu
Display menu
Highlight menu
Statistics menu
Help: This item brings up the help file for the Sequence Data Explorer.
Tool Bar
The tool bar provides quick access to the following menu items:
General Utilities

: This brings up the Exporting Sequence Data dialog box, which contains options to control how MEGA writes the
output data, available options are Text, MEGA, CSV, and Excel.

: This brings up the Exporting Sequence Data dialog box and sets the default output format to MEGA.

: This brings up the Exporting Sequence Data dialog box and sets the default output format to Excel.

: This brings up the Exporting Sequence Data dialog box and sets the default output format to CSV (Comma
separated values).

: This brings up the dialog box for setting up and selecting domains and genes.

: This brings up the dialog box for setting up, editing, and selecting taxa and groups of taxa.

: This toggle replaces the nucleotide/amino acid at a site with the identical symbol (e.g. a dot) if the site contains
the same nucleotide/amino acid.

: This button provides the facility to translate codons in the sequence data into amino acid sequences and
back. All protein-coding regions will be automatically identified and translated for display. When the translated
sequence is already displayed, then issuing this command displays the original nucleotide sequences (including all
coding and non-coding regions). Depending on the data displayed (translated or nucleotide), relevant menu options in
the Sequence Data Explorer become enabled. Note that the translated/un-translated status in this data explorer does
not have any impact on the options for analysis available in MEGA (e.g., Distances or Phylogeny menus),
as MEGA provides all possible options for your dataset at all times.

Highlighting Sites
C: If this button is pressed, then all constant sites will be highlighted. A count of the highlighted sites will be
displayed on the status bar.
V: If this button is pressed, then all variable sites will be highlighted. A count of the highlighted sites will be
displayed on the status bar.
Pi: If this button is pressed, then all parsimony-informative sites will be highlighted. A count of the highlighted sites
will be displayed on the status bar.
S: If this button is pressed, then all singleton sites will be highlighted. A count of the highlighted sites will be
displayed on the status bar.
L: If this button is pressed, then all labelled sites will be highlighted and a count of highlighted sites will be displayed
on the status bar (see also labelled sites).
0: If this button is pressed, then sites will be highlighted only if they are zero-fold degenerate sites in all sequences
displayed. A count of highlighted sites will be displayed on the status bar. (This button is available only if the dataset
contains protein coding DNA sequences).
2: If this button is pressed, then sites will be highlighted only if they are two-fold degenerate sites in all sequences
displayed. A count of highlighted sites will be displayed on the status bar. (This button is available only if the dataset
contains protein coding DNA sequences).
77
4: If this button is pressed, then sites will be highlighted only if they are four-fold degenerate sites in all sequences
displayed. A count of highlighted sites will be displayed on the status bar. (This button is available only if the dataset
contains protein coding DNA sequences).
Special: This dropdown allows for the selection of a special highlighting option.
CpG/TpG/CpA: if this button is pressed, then all sites which have a C followed by a G, T by G, or C by A will be
highlighted. You may also select a percentage of sequences which must have these properties for a site to be counted.
Coverage: if this button is pressed, then you will enter a percentage. All the sites with this percentage or less of
ambiguous sites will be highlighted.

: This button allows you to quickly navigate between highlighted sites by jumping to the previous or next
highlighted site.
Searching

: This button allows you to specify a sequence name to find. Search results are bolded and the row is highlighted
blue. MEGA first looks for an exact match to the name you specified, if none exists it looks for names starting with
what you provided, if no names start with the provided search term, then MEGA looks for your search term anywhere
in the names(rather than just the start).

: This button allows you to specify a Motif to search for in the sequence data. This Motif supports IUPAC codes
such as R (for A or G) and Y (for T or C). MEGA highlights (in Yellow) the first instance of this motif it finds.

and : These buttons are only enabled if you have already searched for a Sequence Name or Motif. By
clicking the forward or backward button MEGA will search for the next or previous search result (assuming there is
more than one possible matches).
The 2-Dimensional Data Grid
Fixed Row: This is the first row in the data grid. It is used to display the nucleotides (or amino acids) in the first
sequence when you have chosen to show their identity using a special character. For protein coding regions, it also
clearly marks the first, second, and the third codon positions.
Fixed Column: This is the first and the leftmost column in the data grid. It is always visible, even when you are
scrolling through sites. The column contains the sequence names and an associated check box. You can check or
uncheck this box to include or exclude a sequence from analysis. Also in this column, you can drag-and-drop
sequences to sort them.
Rest of the Grid: Cells to the right of and below the first row contain the nucleotides or amino acids of the input
data. Note that all cells are drawn in light color if they contain data corresponding to unselected sequences or genes
or domains.
Status Bar
This section displays the location of the focused site and the total sequence length. It also shows the site label, if any,
and a count of the highlighted sites.

DATA MENU

General Description
Data | Export Data
The Exporting Sequence Data dialog box first displays an edit box for entering a title for the sequence data being
exported. The default name is the original name of the data set, if there was one. Below the title is a space for
entering a brief description of the data set being exported.
Next is the option for determining the format of the data set being exported; MEGA currently allows the user to export
the data in MEGA, PAUP 3.0 and PAUP 4.0 (Nexus, Interleaved in both cases), and PHYLIP 3.0 (Interleaved). tA the
end of each line, is “Writing site numbers.” The three options available are to not write any number, to write one for
each site, or to write the site number of the last site.
Other options in this dialog box include the number of sites per line, which codon position(s) is to be used and
whether non-coding regions should be included, and whether the output is to be interleaved. For missing or
ambiguous data and alignment gaps, there are four options: include all such data, exclude all such data, exclude or
include sites with missing or ambiguous data only, and exclude sites with alignment gaps only.

78
Data Menu (Sequence Data Explorer)
This menu provides commands for working with selected data in the Sequence Data Explorer
The commands in this menu are:
Write Data to File Brings up the Exporting Sequence Data dialog box.
Translate/Untranslate Translates protein-coding nucleotide sequences into protein sequences, and back to
nucleotide sequences.
Select Genetic Code Table Brings up the Select Genetic Code dialog box, in which you can select, edit or add a
genetic code table.
Setup/Select Genes and Brings up the Sequence Data Organizer, in which you can define and edit genes
Domains and domains.
Setup/Select Taxa and Brings up the Setup/Select Taxa & Groups Dialog dialog, in which you can
Groups edit taxa and define groups of taxa.
Quit Data Viewer Takes the user back to the main interface.

Save Session (in Sequence Data Explorer)


Data | Save Session
This saves the current session you are working on so it may later be resumed. Read further about session
saving.

Export Data (Sequence Data Explorer)


Data | Export Data
The Exporting Sequence Data dialog box first displays an edit box for entering a title for the sequence data being
exported. The default name is the original name of the data set, if there was one. Below the title is a space for
entering a brief description of the data set being exported.
Next is the option for determining the format of the data set being exported; MEGA currently allows the user to export
the data in MEGA, PAUP 3.0 and PAUP 4.0 (Nexus, Interleaved in both cases), and PHYLIP 3.0 (Interleaved). tA the
end of each line, is “Writing site numbers.” The three options available are to not write any number, to write one for
each site, or to write the site number of the last site.
Other options in this dialog box include the number of sites per line, which codon position(s) is to be used and
whether non-coding regions should be included, and whether the output is to be interleaved. For missing or
ambiguous data and alignment gaps, there are four options: include all such data, exclude all such data, exclude or
include sites with missing or ambiguous data only, and exclude sites with alignment gaps only.

Translate/Untranslate (in Sequence Data Explorer)


Data | Translate/Untranslate
This command is available only if the data contain protein-coding nucleotide sequences. It automatically extracts all
protein-coding domains for translation and displays the corresponding protein sequence. If the translated sequence is
already displayed, then issuing this command displays the original nucleotide sequences, including all coding and
non-coding regions. Depending on the data displayed (translated or nucleotide), relevant menu options in the
Sequence Data Explorer are enabled. However, translated and un-translated status does not have any impact on the
analytical options available in MEGA (e.g., Distances or Phylogeny menus), as MEGA provides all possible options
for your dataset at all times.

Select Genetic Code (in Sequence Data Explorer)


Data | Select Genetic Code Table
Select Genetic Code Table, can be invoked from within the Data menu in Sequence Data Explorer, and is also
available in the main interface directly in the Data Menu.

Setup/Select Genes & Domains (Sequence Data Explorer)

79
Data | Setup/Select Genes & Domains
Setup/Select Genes & Domains, can be invoked from within the Data menu in Sequence Data Explorer, and is also
available in the main interface directly in the Data Menu.
Setup/Select Taxa & Groups (in Sequence Data Explorer)
Data | Setup/Select Taxa & Groups
Setup/Select Taxa & Groups, can be invoked from within the Data menu in Sequence Data Explorer, and is
also available in the main interface directly in the Data Menu.

Quit Data Viewer


Data | Quit Data Viewer
This command closes the Sequence Data Explorer, and takes the user back to main interface.

DISPLAY MENU

GENERAL DESCRIPTION

Display Menu (in Sequence Data Explorer)

This menu provides commands for adjusting the display of DNA and protein sequences in the grid.
The commands in this menu are:
Show only selected sequences: To work only in a subset of the sequences in the data set, use the check
boxes to select the sequences of interest.
Use Identical Symbol: If this site contains the same nucleotide (amino acid) as appears in the first sequence
in the list, this command replaces the nucleotide (amino acid) symbol with a dot (.). If you uncheck this
option, the Sequence Data Explorer displays the single letter code for the nucleotide (amino acid).
Color Cells: This option displays the sequences such that consecutive sites with the same nucleotide (amino
acid) have the same background color.
Select Color: This option changes the color for highlighted sites. It is Yellow by default.
Sort Sequences: The sequences in the data set can be sorted based on several options: sequence names,
group names, group and sequence names, or as per the order in the Select/Edit Taxa Groups dialog box.
Restore input order: This option resets any changes in the order of the displayed sequences (due to sorting,
etc.) back to that in the input data file.
Show Sequence Name: The name of the sequences can be displayed or hidden by checking or unchecking
this option. If the sequences have been grouped, then unchecking this option causes only the group name to
be retained. If no groups have been made, then no name is displayed.
Show Group Name. This option can be used to display or hide group names if the taxa have been
categorized into groups.
Change Font. Brings up the Font dialog box, allowing the user to choose the type, style, size, etc. of the
font to display the sequences.

Restore Input Order

Display | Restore Input Order


Choosing this restores the order in Sequence Data Explorer to that in the input text file.

Show Only Selected Sequences


Display | Show only Selected Sequences

80
The check boxes in the left column of the display grid can be used to select or deselect sequences for
analysis. Subsequent use of the “Show Only Selected Sequences” option in the Display menu of Sequence
Data Explorer hides all the deselected sequences and displays only the selected ones.

Color Cells

Display | Color cells


This command colors individual cells in the two-dimensional display grid according to the nucleotide or
amino acid it contains. A list of default colors, based on the biochemical properties of the residues, is given
below. In a future version, these colors will be customizable by the user.

For DNA sequences:


SymbolColor
A Yellow
G Fuchsia
C Olive
T Green
U Green

For amino acid sequences:


SymbolColor SymbolColor
A Yellow M Yellow
C Olive N Green
D Aqua P Blue
E Aqua Q Green
F Yellow R Red
G Fuchsia S Green
H Teal T Green
I Yellow V Yellow
K Red W Green
L Yellow Y Lime

Use Identical Symbol

Display | Use Identical Symbol


Data that contain multiple aligned sequences may be easier to view if, when the nucleotide (amino acid) is
the same as that in the corresponding site in the first sequence, the nucleotide (amino acid) is replaced by a
dot. Choosing this option again brings back the nucleotide (amino acid) single-letter codes.

Show Sequence Names

Display | Show Sequence Names


This option displays the full sequence names in Sequence Data Explorer

Show Group Names

Display | Show Group Names


This option displays the full group names in Sequence Data Explorer if the sequences have been grouped
in Select/Edit Taxa Groups
81
Change Font...

Display | Change Font…


This command brings up the Change Font dialog box, which allows you to change the display font,
including font type, style and size. Options to strikeout or underline selected parts of the sequences are also
available. There is also an option for using different scripts, although the only option currently available is
“Western”. Finally the “Sample” window displays the effects of your choices

Sort Sequences

Display | Sort Sequences


The sequences in the data set can be sorted based on several options: sequence name, group name, group and
sequence names, or as per the order in the Select/Edit Taxa Groups dialog box.

Sort Sequences by Group Name

Display | Sort Sequences | By Group Name


Sequences that have been grouped in Select/Edit Taxa Groups can be sorted by the alphabetical order
of group names or numerical order of group ID numbers. If the group names contain both a name and a
number, the numerical order will be nested within the alphabetical order.

Sort Sequences by Group and Sequence Names

Display | Sort Sequences | By Group and Sequence Names


Sequences that have been grouped in Select/Edit Taxa Groups can be sorted by the alphabetical order
of group names or the numerical order of group ID numbers. If the group names contain both a name and a
number, the numerical order is nested within the alphabetical order. The sequences can be further arranged
by sorting the sequence names within the group names.

Sort Sequences As per Taxa/Group Organizer

Display | Sort Sequences | As per Taxa/Group Organizer


The sequence/group order seen in Select/Edit Taxa Groups is initially the same as the order in the input text
file. However, this order can be changed by dragging-and-dropping. Choose this option if you wish to see
the data in the same order in the Sequence Data Explorer as in Select/Edit Taxa Groups.

Sort Sequences By Sequence Name

Display | Sort Sequences | By Sequence Name


The sequences are sorted by the alphabetical order of sequence names or the numerical order of sequence ID
numbers. If the sequence names contain both a name and a number, then the sorting is done with the
numerical order nested within the alphabetical order.

Highlight Menu (in Sequence Data Explorer)


82
This menu can be used to highlight certain types of sites. The options are constant sites, variable
sites, parsimony-informative sites,singleton sites, 0-fold, 2-fold and 4-fold degenerate sites.

Highlight Conserved Sites

Highlight | Conserved Sites


Use this command to highlight constant sites

Highlight Variable Sites

Highlight | Variable Sites


Use this command to highlight variable sites sites.

Highlight Singleton Sites

Highlight | Singleton Sites


Use this command to highlight singleton sites.

Highlight Parsimony Informative Sites

Highlight | Parsim-Info Sites


Use this command to highlight parsimony-informative sites.

Highlight 0-fold Degenerate Sites

Highlight | 0-fold Degenerate Sites


Use this command to highlight 0-fold degenerate sites.

Highlight 2-fold Degenerate Sites

Highlight | 2-fold Degenerate Sites


Use this command to highlight 2-fold degenerate sites. The command is visible only if the data consists of
nucleotide sequences.

Highlight 4-fold Degenerate Sites

Highlight | 4-fold Degenerate Sites


Use this command to highlight 4-fold degenerate sites. The command is visible only if the data consists of
nucleotide sequences.

HIGHLIGHT SPECIAL

Highlight Coverage

83
Highlights sites where there are a certain percentage (or higher) of unambiguous nucleotides or amino
acids. 100% would mean that all elements in a site would need to be unambiguous.

Highlight CpG/TpG/CpA
Highlights sites which have a C followed by a G, T by a G, or C by a A. This also uses coverage, the default
is 100%.

SEARCH MENU

Find Prev Name

If you have already searched for a sequence name, this will find the previous instance of the search term in
relation to the row currently selected.

Find Next Name


If you have already searched for a sequence name, this will find the next instance of the search term in
relation to the row currently selected.

Find Sequence Name

This allows you to specify a sequence name to find. Search results are bolded and the row is highlighted
blue. MEGA first looks for an exact match to the name you specified, if none exists it looks for names
starting with what you provided, if no names start with the provided search term, then MEGA looks for your
search term anywhere in the names (rather than just the start).

Hide Name Result

If you have searched for a Sequence name, this will hide that search result. When the search result is hidden
the sequence name is not bolded, and the previous and next buttons are disabled.

Find Prev Motif

If you have already searched for a motif, this will find the previous instance of the search term in relation to
the row and column currently selected.

Find Next Motif

If you have already searched for a motif, this will find the next instance of the search term in relation to the
row and column currently selected.

84
Find Motif

You may specify a Motif to search for in the sequence data. This Motif supports IUPAC codes such as R
(for A or G) and Y (for T or C), etc. MEGA jumps to the start of the first result for this motif it finds and
highlights it in Yellow.

Hide Motif

If you have searched for a motif, this will hide all the search result(s). When the search result(s) are hidden
the yellow highlighting will dissapear, and the previous and next buttons are disabled.

STATISTICS MENU

Statistics Menu (in Sequence Data Explorer)

Various summary statistics of the sequences can be computed and displayed using this menu. The
commands are:
Nucleotide Composition
Nucleotide Pair Frequencies
Codon Usage
Amino Acid Composition
Use All Selected Sites
Use only Highlighted Sites. Sites can be selected according to various criteria (see Highlight Sites), and
analysis can be performed only on the chosen subset of sites.
Display results in Excel (XL) - Only effects outputs from the Statistics menu
Display results in Comma-Delimited (CSV) - Only effects outputs from the Statistics menu
Display results in Text Editor - Only effects outputs from the Statistics menu

Nucleotide Composition

Statistics | Nucleotide Composition


This command is visible only if the data consist of nucleotide sequences. MEGA computes the base
frequencies for each sequence as well as an overall average. These will be displayed by domain in a Text
Editor domain (if the domains have been defined in Setup/Select Genes & Domains).

Nucleotide Pair Frequencies

Statistics | Nucleotide Pair Frequencies


This command is visible only if the data consists of nucleotide sequences. There are two options available:
one in which the nucleotide acid pairs are counted bidirectionally site-by-site for the two sequences (giving
rise to 16 different nucleotide pairs), the other, in which the pairs are counted unidirectionally (10 nucleotide
pairs). MEGA will compute the frequencies of these quantities for each sequence as well as an overall
average. They will be displayed by domain (if domains have been defined in Setup/Select Genes
& Domains).

Codon Usage

Statistics | Codon Usage

85
This command is visible only if the data contains protein-coding nucleotide sequences. MEGA computes
the percent codon usage and the RCSU values for each codon for all sequences included in the
dataset. Results will be displayed in by domain (if domains have been defined in Setup/Select Genes
& Domains).

Amino Acid Composition

Statistics | Amino acid Composition


This command is visible only if the data consists of amino acid sequences or if the translated protein coding
nucleotide sequences are displayed. MEGA will compute the amino acid frequencies for each sequence as
well as an overall average, which will be displayed by domain (if domains have been defined in
Setup/Select Genes & Domains).

Use All Selected Sites

Statistics | Use All Selected Sites


Analysis is conducted on all sites in the sequences, irrespective of whether any sites have been labeled or
highlighted.

Use only Highlighted Sites

Statistics | Use only Highlighted Sites


Sites can be selected according to various criteria (see Highlight Sites), and analyses will be performed only
on the chosen subset of sites. All statistical attributes will be based on these sites.

Display Results in XL/CSV/Text

Items in statistic viewer which have output will be written in one of these three formats. If text is selected
all items in the statistics menu will show their output as text. The same will happen for Comma Separated
Values(CSV) and Excel(XL).
Only one of these output formats may be selected at any one time.

DISTANCE DATA EXPLORER

The Distance Data Explorer shows the pair-wise distance data. This explorer is flexible and it provides
useful functionalities for computing within group, among group, and overall averages, as well as facilities
for selecting data subsets.
This explorer consists of a number of regions as follows:
Menu Bar
File menu
Display menu
Average menu
Help: This item brings up the help file.
Tool Bar
The tool bar provides quick access to a number of menu items.
General Utilities

86
: This icon brings up the Options dialog box to export the distance matrix as a text file with options to
control how MEGA writes which contains options to control how MEGA writes the output data, available
options are Text, MEGA, CSV, and Excel.
: This button brings up the dialog box for setting up, editing, and selecting taxa and groups of taxa.
Distance Display Precision
: With each click of this button, the precision of the distance display is decreased by one decimal place.
: With each click of this button, the precision of the distance display is decreased by one decimal place.
Column Sizer: This is a slider that can be used to increase or decrease the width of the columns that show
the pairwise distances.
The 2-Dimensional Data Grid
This grid displays the pair-wise distances between all the sequences in the data in the form of a lower or
upper triangular matrix. The names of the sequences and groups are the row-headers; the column headers
are numbered from 1 to m, m being the number of sequences. There is a column sizer button for the row-
headers, so you can increase or decrease the column size to accommodate the full name of the sequences and
groups.
Fixed Row: This is the first row in the data grid that displays the column number.
Fixed Column: This is the first and the leftmost column in the data grid and contains taxa names. Even if
you scroll past the initial screen this column will always be visible. To include a taxon in the data set for
analysis, check the associated box. In this column, you also can drag-and-drop taxa names to sort them in
the desired manner.
Rest of the Grid: The cells to the right of the first column and below the first row contain the nucleotides or
amino acids of the input data. Note that all cells containing data corresponding to unselected sequences or
genes/domains are drawn in a light color.
Status bar
The status bar shows the sequence pair corresponding to the position of the cursor when the cursor is on any
distance value in the display.

File Menu (in Distance Data Explorer)

The File menu consists of three commands:


Select & Edit Taxa/Groups: This brings up a dialog box to categorize the taxa into groups.
Export/Print Distances: This brings up a dialog box for writing pairwise distances as a text file, with a
choice of several formats.
Quit Viewer: This closes the Distance Data Explorer.

Display Menu (in Distance Data Explorer)

The Display menu consists of four main commands:


Show Only Selected Taxa: This is a toggle, showing a matrix of all or only selected taxa.
Sort Taxa: This provides a submenu for sorting the order of taxa in one of three ways: by input order, by
taxon name or by group name.
Show Group Names: This is a toggle for displaying or hiding the group name next to the name of each
taxon, when available.
Change Font: This brings up the dialog box, which allows you to choose the type and size of the font used to
display the distance values.

Average Menu (in Distance Data Explorer)

87
This menu is used for the computation of average values using the selected taxa. The following averaging
options are available:
Overall: This computes and displays the overall average.
Within groups: This is enabled only if at least one group is defined. For each group, an arithmetic average
is computed for all valid pairwise comparisons and results are displayed in the Distance Matrix
Explorer. All incalculable within-group averages are shown with a red “n/c”.
Between groups: This is enabled only if at least two groups of taxa are defined. For each between group
averages, an arithmetic average is computed for all valid inter-group pairwise comparisons and results are
displayed in the Distance Matrix Explorer. All incalculable within group averages are shown with a red
“n/c”.
Net Between Groups: This computes net average distances between groups of taxa and is enabled only if at
least two groups of taxa with at least two taxa each are defined. The net average distance between two
groups is given by
dA = dXY – (dX + dY)/2
where, dXY is the average distance between groups X and Y, and dX and dY are the mean within-group
distances. All incalculable within group averages are shown with a red “n/c”.

Options dialog box

At the top of the options dialog box is an option for the output format (Publication and MEGA) with the type
of information that is output (distances) mentioned beneath. Below this is the option for outputting the
distance data as a lower left triangular matrix or an upper right triangular matrix. On the right are options for
specifying the number of decimal places for the pairwise distances in the output, and the maximum number
of distances per line in the matrix.
When exporting to Excel or CSV you can choose to export as either a normal matrix or in a column format
(Species 1, Species 2, Distance, Std Err.). The standard Matrix has a limit of 255 columns (that means
255 taxa) due to a limit imposed by Excel caused by a maximum number of columns.
In addition there are three buttons, one to print or save the output, one to quit the Options dialog box without
exporting the data (Cancel), and the third to bring up the help file (this file). The Print/Save button brings
up the Distances Display Box, where the distances are displayed as specified, with various options to edit,
print and save the output.

Text Editor

MEGA includes a Text File Editor, which is useful for creating and editing ASCII text files. It is invoked
automatically by MEGA if the input data file processing modules detect errors in the data file format. In
this case, you should make appropriate changes and save the data file.
The text editor is straightforward if you are familiar with programs like Notepad. Click on the section you
wish to change, type in the new text, or select text to cut, copy or paste. Only the display font can be used
in a document. You can have as many different text editor windows open at one time and you may close
them independently. However, if you have a file open in the Text Editor, you should save it and close
the Text Editor window before trying to use that data file for analysis in MEGA. Otherwise, MEGA may
not have the most up-to-date version of the data.
The Text File Editor and Format converter is a sophisticated tool with numerous special capabilities that
include:
• Large files –The ability to operate on files of virtually unlimited size and line lengths.
• General purpose –Used to view/edit any ASCII text file.
• Undo/ReDo –The availability of an unlimited depth of undo/redo options
• Search/Replace –Searches for and does block replacements for arbitrary strings.
88
• Clipboard – Supports familiar clipboard cut, copy, and paste operations.
• Normal and Column blocks – Supports regular contiguous line blocks and columnar blocks. This
is quite useful while manually aligning sequences in the Text Editor.
• Drag/Drop – Moves text with the familiar cut and paste operations or you can select the text and
then move it with the mouse.
• Printing –Prints the contents of the edit file.

The Text Editor contains a menu bar, a toolbar, and a status bar.
The Menu bar
Menu Description
File menu The File Menu contains the functions that are most commonly used to open, save,
rename, print, and close files. (Although there is no separate “rename” function
available, you can rename a file by choosing the Save As… menu item and giving the
file a different name before you save it.)
Edit menu The Edit Menu contains functions that are commonly used to manipulate blocks of text.
Many of the edit menu items interact with the Windows Clipboard, which is a hidden
window that allows various selections to be copied and pasted across documents and
applications.
Search menu The Search Menu has several functions that allow you to perform searches and
replacements of text strings. You can also jump directly to a specific line number in the
file.
Display menu The Display Menu contains functions that affect the visual display of files in the edit
windows.
Utilities menu The Utilities Menu contains several functions that make this editor especially useful for
working with files containing molecular sequence data (note that the MEGA editor does
not try to understand the contained data, it simply operates on the text, assuming that the
user knows what (s)he is doing.
Toolbar
The Toolbar contains shortcuts to some frequently used menu commands.
Status Bar
The Status bar is positioned at the bottom of the editor window. It shows the position of the cursor (line
number and position in the line), whether the file has been edited, and the status of some keyboard keys
(CAPS, NUM, and SCROLL lock).

Hotkeys and Shortcut keys


Many menu items have a hotkey and/or a shortcut key. These are special key combinations that are
helpful for people who are more comfortable using a keyboard than the mouse. Hotkeys are identified by
an underscore character in the name of the menu item, e.g., “File”, “New”. These allow you to hold down
the Alt-key, which is usually found next to the space bar on the keyboard, then hit the underlined letter to
produce the same action as if you clicked that name with the mouse. We show this using the notation
<Alt>+key – e.g., the hotkey for the file menu item is shown as <Alt>+F. Be sure that you depress both
keys together, holding the <Alt> key down a little bit longer than the letter key. (Some people try hitting
both keys simultaneously, as if they’re hitting two keys on a piano keyboard. Quite often, this approach
does not produce the desired results.)
For instance, you could create a new file by clicking the mouse on the “File” menu item, then
clicking on the “New” item beneath it. Using hotkeys, you could type <Alt>+F followed by <Alt>+N. Or,

89
more simply, while you’re holding down the <Alt> key, hit the ‘F’ key followed by the ‘N’ key, then
release the <Alt> key.
You might notice that several menu items, e.g., the New Item on the File menu, show something to
the right that looks like ‘Ctrl+N’. This is called a Shortcut key sequence. Whereas executing a command
with hotkeys often requires several keystrokes, shortcut keys can do the same thing with just one
keystroke. Shortcut keys work the same as hotkeys, using the <Ctrl> key instead of the <Alt> key. To
create a new file, for example, you can hold down the <Ctrl> key and hit the ‘N’ key, which is shown as
<Ctrl>+N here. (In the menus, this appears simply as ‘Ctrl+N’.)
Not all menu items have associated shortcut keys because there are only 26 shortcut keys, one for
each letter of the alphabet. Hotkeys, in contrast, are localized to each menu and submenu. For hotkeys to
work, the menu item must be visible whereas shortcut keys work at any time. For instance, if you are
typing data into a text file and want to create a note in a new window, you may simply hit the shortcut key
sequence, <Ctrl>+N to generate a new window. After you type the note, you can hit <Ctrl>+S to save it,
give it a file name, hit the enter key [this part doesn’t make sense]; then you can hit the <Alt>+F+C
hotkey sequence to close the file (there is no shortcut key for closing a file).

Text Editor

MEGA includes a Text File Editor, which is useful for creating and editing ASCII text files. It is invoked
automatically by MEGA if the input data file processing modules detect errors in the data file format. In
this case, you should make appropriate changes and save the data file.
The text editor is straightforward if you are familiar with programs like Notepad. Click on the section you
wish to change, type in the new text, or select text to cut, copy or paste. Only the display font can be used
in a document. You can have as many different text editor windows open at one time and you may close
them independently. However, if you have a file open in the Text Editor, you should save it and close
the Text Editor window before trying to use that data file for analysis in MEGA. Otherwise, MEGA may
not have the most up-to-date version of the data.
The Text File Editor and Format converter is a sophisticated tool with numerous special capabilities that
include:
• Large files –The ability to operate on files of virtually unlimited size and line lengths.
• General purpose –Used to view/edit any ASCII text file.
• Undo/ReDo –The availability of an unlimited depth of undo/redo options
• Search/Replace –Searches for and does block replacements for arbitrary strings.
• Clipboard – Supports familiar clipboard cut, copy, and paste operations.
• Normal and Column blocks – Supports regular contiguous line blocks and columnar blocks. This
is quite useful while manually aligning sequences in the Text Editor.
• Drag/Drop – Moves text with the familiar cut and paste operations or you can select the text and
then move it with the mouse.
• Printing –Prints the contents of the edit file.

The Text Editor contains a menu bar, a toolbar, and a status bar.
The Menu bar
Menu Description
File menu The File Menu contains the functions that are most commonly used to open, save,
rename, print, and close files. (Although there is no separate “rename” function
available, you can rename a file by choosing the Save As… menu item and giving the
file a different name before you save it.)

90
Edit menu The Edit Menu contains functions that are commonly used to manipulate blocks of text.
Many of the edit menu items interact with the Windows Clipboard, which is a hidden
window that allows various selections to be copied and pasted across documents and
applications.
Search menu The Search Menu has several functions that allow you to perform searches and
replacements of text strings. You can also jump directly to a specific line number in the
file.
Display menu The Display Menu contains functions that affect the visual display of files in the edit
windows.
Utilities menu The Utilities Menu contains several functions that make this editor especially useful for
working with files containing molecular sequence data (note that the MEGA editor does
not try to understand the contained data, it simply operates on the text, assuming that the
user knows what (s)he is doing.
Toolbar
The Toolbar contains shortcuts to some frequently used menu commands.
Status Bar
The Status bar is positioned at the bottom of the editor window. It shows the position of the cursor (line
number and position in the line), whether the file has been edited, and the status of some keyboard keys
(CAPS, NUM, and SCROLL lock).

Hotkeys and Shortcut keys


Many menu items have a hotkey and/or a shortcut key. These are special key combinations that are
helpful for people who are more comfortable using a keyboard than the mouse. Hotkeys are identified by
an underscore character in the name of the menu item, e.g., “File”, “New”. These allow you to hold down
the Alt-key, which is usually found next to the space bar on the keyboard, then hit the underlined letter to
produce the same action as if you clicked that name with the mouse. We show this using the notation
<Alt>+key – e.g., the hotkey for the file menu item is shown as <Alt>+F. Be sure that you depress both
keys together, holding the <Alt> key down a little bit longer than the letter key. (Some people try hitting
both keys simultaneously, as if they’re hitting two keys on a piano keyboard. Quite often, this approach
does not produce the desired results.)
For instance, you could create a new file by clicking the mouse on the “File” menu item, then
clicking on the “New” item beneath it. Using hotkeys, you could type <Alt>+F followed by <Alt>+N. Or,
more simply, while you’re holding down the <Alt> key, hit the ‘F’ key followed by the ‘N’ key, then
release the <Alt> key.
You might notice that several menu items, e.g., the New Item on the File menu, show something to
the right that looks like ‘Ctrl+N’. This is called a Shortcut key sequence. Whereas executing a command
with hotkeys often requires several keystrokes, shortcut keys can do the same thing with just one
keystroke. Shortcut keys work the same as hotkeys, using the <Ctrl> key instead of the <Alt> key. To
create a new file, for example, you can hold down the <Ctrl> key and hit the ‘N’ key, which is shown as
<Ctrl>+N here. (In the menus, this appears simply as ‘Ctrl+N’.)
Not all menu items have associated shortcut keys because there are only 26 shortcut keys, one for
each letter of the alphabet. Hotkeys, in contrast, are localized to each menu and submenu. For hotkeys to
work, the menu item must be visible whereas shortcut keys work at any time. For instance, if you are
typing data into a text file and want to create a note in a new window, you may simply hit the shortcut key
sequence, <Ctrl>+N to generate a new window. After you type the note, you can hit <Ctrl>+S to save it,
give it a file name, hit the enter key [this part doesn’t make sense]; then you can hit the <Alt>+F+C
hotkey sequence to close the file (there is no shortcut key for closing a file).

USING TEXT FILE EDITOR


91
FILE MENU

New (in Text Editor)


File | New
Use this command to create a new file in the Text Editor.

Open (in Text Editor)


File | Open
Use this command to open an existing file in the Text Editor.

Reopen (in Text Editor)


File | Reopen
Choose this command to reopen a recently closed text file from the most-recently-used-files list. When
you close a text file in the Text Editor, it is added to the Reopen list.

Select All (in Text Editor)


Edit | Select All
This is used to select (highlight) everything in the displayed file.

Go to Line (in Text Editor)


Edit | Go to Line #
This opens a small dialog box that allows you to enter a number indicating the line to which you want to
move.

Show Line Numbers (in Text Editor)


Display | Show Line Numbers
This item can be checked (on) or un-checked (off) to show whether line numbers are displayed next to the
lines.

Word Wrap (in Text Editor)


Display | Word Wrap
This item can be checked (on) or un-checked (off) to show whether lines in the edit window are
automatically wrapped around based on the current window’s width.

Save (in Text Editor)


File | Save
This allows you to save the file currently being edited.

Save As (in Text Editor)


File | Save As
This command brings up the Save As dialog box, which allows you to choose the directory, the filename
and extension, and the type of file you wish to save. To make a file suitable for loading as data in MEGA,
you should save the file in MEGA format (it is a plain ASCII text file). If there is already another file
with the same name, it will be overwritten

Print (in Text Editor)


File | Print
This command will print the currently displayed file to the selected printer.
92
Close File (in Text Editor)
File | Close File
This closes the current file.

Exit Editor (in Text Editor)


File | Exit Editor
This closes the currently open file. If the file was modified, but the modifications have not been
saved, MEGA will ask whether to discard the changes. Note that this command exits the Text Editor only,
not MEGA.

Delete (in Text Editor)


Edit | Delete
This deletes the selected (highlighted) text. It is NOT copied to the clipboard.

EDIT MENU
Cut (in Text Editor)
Edit | Cut
This command places a copy of the selected text on the Windows clipboard, removing the original string.
To paste the contents on the clipboard, use the Paste command.

Copy (in Text Editor)


Edit | Copy
This places a copy of the selected text on the Windows clipboard, leaving the original string untouched.
To paste the contents on the clipboard, use the Paste command.

Paste (in Text Editor)


Edit | Paste
This inserts the most recently copied text present on the Windows clipboard.

Undo (in Text Editor)


Edit | Undo
Choose this command to undo your most recent action. Repeated use of this command will undo each
action, starting with the most recent and going to the oldest. It has unlimited depth.

Font (in Text Editor)


Display | Set Font
Choose this command to activate a dialog box with which you can change the display font used by
the Text Editor. Since an ASCII text file does not have a font attribute, it simply contains the text in the
file. Therefore the change in the font only affects the display. The new font is remembered by MEGA as
your preferred display font for the Text Editor.

SEARCH MENU
Find (in Text Editor)
Search | Find
Choose this command to display the Find Text dialog box.

93
Find Again (in Text Editor)
Search | Find Again
Choose this to repeat the last Find command.

Replace (in Text Editor)


Search | Replace
This brings up a Search and Replace dialog box, which allows you to replace a text string in the file
currently being edited.

VISUAL TOOLS FOR DATA MANAGEMENT

Setup/Select Genes & Domains (Sequence Data Explorer)


Data | Setup/Select Genes & Domains
Setup/Select Genes & Domains, can be invoked from within the Data menu in Sequence Data Explorer, and is also
available in the main interface directly in the Data Menu.

Setup/Select Taxa & Groups (in Sequence Data Explorer)


Data | Setup/Select Taxa & Groups
Setup/Select Taxa & Groups, can be invoked from within the Data menu in Sequence Data Explorer, and is
also available in the main interface directly in the Data Menu.

DATA SUBSET SELECTION

Sequence Data Subset Selection


Any subset of sequence data can be selected for analysis using the options in the Data menu. You may:
Select Taxa (sequences) or Groups of taxa through the Setup/Select Taxa & Groups dialog box,
Choose Domains and Genes through the Setup/Select Genes & Domains dialog box,
Items 1 and 2 lead to the construction of a primary data subset, which is maintained until it is modified in the two
dialog boxes mentioned in the above items or in the Sequence Data Explorer.
Select any combination of Codon Positions to use through the Analysis Preferences/Options dialog box from the Data
| Select Preferences menu item in the main interface.
Choose to include only the Labeled Sites through the Data | Select Preferences menu item.
Decide to enforce Complete-Deletion or Pairwise-Deletion of the missing data and alignment gaps.
Items 3, 4, and 5 provide the second level of data subset options. You are given relevant choices immediately prior to
the start of the analysis. Therefore, these choices are secondary in nature and are specific to the currently requested
analysis. The Analysis Preferences dialog box remembers them for your convenience and provides them as a default
the next time you conduct an analysis that utilizes those options.

Distance Data Subset Selection


You may select Select Taxa (sequences) or Groups of taxa through the Setup/Select Taxa & Groups dialog box to
construct a distance matrix. You also can select sequences in the Distance Data Explorer by clicking on the check
marks next to the taxa names.

PART-IV: EVOLUTIONARY ANALYSIS

COMPUTING BASIC STATISTICAL QUANTITIES FOR SEQUENCE DATA

94
Basic Sequence Statistics

In the study of molecular evolution, it often is necessary to know some basic statistical quantities, such as
nucleotide frequencies, codonfrequencies, and transition/transversion ratios. The statistical quantities that
can be computed by MEGA are discussed in this section.

Nucleotide and Amino Acid Compositions

The relative frequencies of the four nucleotides (nucleotide composition) or of the 20 amino acid residues
(amino acid composition) can be computed for one specific sequence or for all sequences. For the coding
regions of DNA, additional columns are presented for the nucleotide compositions at the first, second, and
third codon positions. All results are presented domain-by-domain, if the dataset contains multiple domains.
Results for the amino acid composition are presented in a similar tabular form.

Nucleotide Pair Frequencies

Statistics | Nucleotide Pair Frequencies


This command is visible only if the data consists of nucleotide sequences. There are two options available:
one in which the nucleotide acid pairs are counted bidirectionally site-by-site for the two sequences (giving
rise to 16 different nucleotide pairs), the other, in which the pairs are counted unidirectionally (10 nucleotide
pairs). MEGA will compute the frequencies of these quantities for each sequence as well as an overall
average. They will be displayed by domain (if domains have been defined in Setup/Select Genes
& Domains).

Codon Usage

Statistics | Codon Usage


This command is visible only if the data contains protein-coding nucleotide sequences. MEGA computes
the percent codon usage and the RCSU values for each codon for all sequences included in the
dataset. Results will be displayed in by domain (if domains have been defined in Setup/Select Genes
& Domains).

Pattern tests

The substitution pattern homogeneity between sequences (Kumar and Gadagkar 2001)
Compute Pattern Disparity Index (disparity index) and Compute Composition Distances (pairwise
sequence composition distance) are two test statistics related to the substitution pattern homogeneity
test. (Kumar and Gadagkar 2001).

COMPUTING EVOLUTIONARY DISTANCES

Distance Models

Models for estimating distances

The evolutionary distance between a pair of sequences usually is measured by the number of nucleotide (or amino
acid) substitutions occurring between them. Evolutionary distances are fundamental for the study of molecular
evolution and are useful for phylogenetic reconstructions and the estimation of divergence times. Most of the widely
used methods for distance estimation for nucleotide and amino acid sequences are included in MEGA. In the following
three sections, we present a brief discussion of these methods: nucleotide substitutions, synonymous-nonsynonymous
95
substitutions, and amino acid substitutions. Further details of these methods and general guidelines for the use of these
methods are given in Nei and Kumar (2000). Note that in addition to the distance estimates, MEGA also computes
the standard errors of the estimates using the analytical formulas and the bootstrap method.
Distance methods included in MEGA in divided in three categories (Nucleotide, Syn-nonsynonymous,
and Amino acid):
Nucleotide
Sequences are compared nucleotide-by-nucleotide. These distances can be computed for protein coding and
non-coding nucleotide sequences.
No. of differences
p-distance
Jukes-Cantor Model
with Rate Uniformity Among Sites
with Rate Variation Among Sites
Tajima-Nei Model
with Rate Uniformity and Pattern Homogeneity
with Rate Variation Among Sites
with Pattern Heterogeneity Between Lineages
with Rate Variation and Pattern Heterogeneity
Kimura 2-Parameter Model
with Same Rate Among Sites
with Rate Variation Among Sites
Tamura 3-Parameter Model
with Rate Uniformity and Pattern Homogeneity
with Rate Variation Among Sites
with Pattern Heterogeneity Between Lineages
with Rate Variation and Pattern Heterogeneity
Tamura-Nei Model
With Rate Uniformity and Pattern Homogeneity
with Rate Variation Among Sites
with Pattern Heterogeneity Between Lineages
with Rate Variation and Pattern Heterogeneity
Log-Det Method
with Pattern Heterogeneity Between Lineages
Maximum Composite Likelihood Model
with Rate Uniformity and Pattern Homogeneity
with Rate Variation Among Sites
with Pattern Heterogeneity Between Lineages
with Rate Variation and Pattern Heterogeneity

NUCLEOTIDE SUBSTITUTION MODELS

No. of differences (Nucleotide)

This distance is the number of sites at which the two compared sequences differ. If you are using the pairwise
deletion option for handling gaps and missing data, it is important to realize that this count does not normalize the
number of differences based on the number of valid sites compared, if the sequences contain alignment
gaps. Therefore, we recommend that if you use this distance you use the complete-deletion Option.

For this distance, MEGA provides facilities for computing the following quantities:
d: Transitions + Transversions: Number of different nucleotide sites.
s: Transitions only: Number of nucleotide sites with transitional differences.
96
v: Transversions only: Number of nucleotide sites with transversional differences.
R = s/v: Transition/transversions ratio.
L: No of valid common sites: Number of compared sites.
Formulas for computing these quantities and their variances are as follows.
Var(d) = nd(L - nd)/L
Var(s) = s(L - s)/L
Var(v) = v(L - v)/L
R = s/v
Var(R) = [c12P + c22Q – (c1P + c2Q)2)]/L
where c1 = 1/s and c2 = -s/v2
P and Q are the proportion of sites showing transitional and transversional differences, respectively.

See also Nei and Kumar (2000), page 33.

p-distance (Nucleotide)

This distance is the proportion (p) of nucleotide sites at which two sequences being compared are different. It is
obtained by dividing the number of nucleotide differences by the total number of nucleotides compared. It does not
make any correction for multiple substitutions at the same site, substitution rate biases (for example, differences in
the transitional and transversional rates), or differences in evolutionary rates among sites.

MEGA provides facilities for computing following p-distances and related quantities:

d: Transitions + Transversions : Proportion of nucleotide sites that are different.


s: Transitions only : Proportion of nucleotide sites with transitional differences.
v: Transversions only : Proportion of nucleotide sites with transversional differences.
R = s/v : Transition/transversions ratio.
L: No of valid common sites: Number of sites compared.

Formulas for computing these quantities are as follows:


QuantityFormulaVariance
p, nd/L, p(1 – p)/L
s, p, s(1 – s)/L
v, Q, v(1 – v)/L
R, P/Q, [c12P + c22Q – (c1P + c2Q)2)]/L
where c1 = 1/s and c2 = -s/v2
P and Q are the proportion of sites showing transitional and transversional differences, respectively.

See also Nei and Kumar (2000), page 33.

Jukes-Cantor distance

In the Jukes and Cantor (1969) model, the rate of nucleotide substitution is the same for all pairs of the four
nucleotides A, T, C, and G. As is shown below, the multiple hit correction equation for this model produces a
maximum likelihood estimate of the number of nucleotide substitutions between two sequences. It assumes an
equality of substitution rates among sites (see the related gamma distance), equal nucleotide frequencies, and it does
not correct for higher rate of transitional substitutions as compared to transversional substitutions.

The Jukes-Cantor model

97
MEGA provides facilities for computing the following quantities:
d: Transitions + Transversions : Number of nucleotide substitutions per site.
L: No of valid common sites: Number of sites compared.
Formulas for computing these quantities are as follows:
Distance

where p is the proportion of sites with different nucleotides.

Variance

See also Nei and Kumar (2000), page 36.

Tajima-Nei distance

In real data, nucleotide frequencies often deviate substantially from 0.25. In this case the Tajima-Nei distance (Tajima
and Nei 1984) gives a better estimate of the number of nucleotide substitutions than the Jukes-Cantor distance. Note
that this assumes an equality of substitution rates among sites and
between transitional and transversional substitutions.

The Felsenstein-Tajima-Nei model

98
MEGA provides facilities for computing the following quantities for this method:
d: Transitions + Transversions: Number of nucleotide substitutions per site.
L: No of valid common sites: Number of sites compared.
Formulas for computing these quantities are as follows:
Distance

where p is the proportion of sites with different nucleotides and

where xij is the relative frequency of the nucleotide pair i and j, gi’s are the nucleotide frequencies.
Variance

See also Nei and Kumar (2000), page 38.

Kimura 2-parameter distance

Kimura’s two parameter model (1980) corrects for multiple hits, taking into account transitional and transversional
substitution rates, while assuming that the four nucleotide frequencies are the same and that rates of substitution do not
vary among sites (see related Gamma distance).
99
The Kimura 2-parameter model

MEGA provides facilities for computing the following quantities:


Quantity Description
d: Transitions + Number of nucleotide substitutions per
Transversions site.
s: Transitions only Number of transitional substitutions
per site.
v: Transversions only Number of transversional substitutions
per site.
R = s/v Transition/transversions ratio.
L: No of valid common Number of sites compared.
sites
Formulas for computing these quantities are as follows:
Distances

where P and Q are the frequencies of sites with transitional and transversional differences respectively, and

100
Variances

where

See also Nei and Kumar (2000), page 37.

Tamura 3-parameter distance

Tamura’s 3-parameter model corrects for multiple hits, taking into account differences in transitional
and transversional rates and G+C-content bias (1992). It assumes an equality of substitution rates among sites.

The Tamura 3-parameter model

101
MEGA provides facilities for computing the following quantities:
Quantity Description
d: Transitions & Number of nucleotide substitutions per site.
Transversions
s: Transitions only Number of transitional substitutions per site.
v: Transversions only Number of transversional substitutions per
site.
R = s/v Transition/transversions ratio.
L: No of valid common sites Number of sites compared.

The formulas for computing these quantities are as follows:


Distances

where P and Q are the proportion of sites with transitional and transversional differences respectively, and

102
Variances

where

See also Nei and Kumar (2000), page 39.

Tamura-Nei distance

103
The Tamura-Nei model (1993) corrects for multiple hits, taking into account the differences in substitution rate
between nucleotides and the inequality of nucleotide frequencies. It distinguishes between transitional substitution
rates between purines and transversional substitution rates between pyrimidines. It also assumes equality of
substitution rates among sites (see related gamma model).

The Tamura-Nei model

MEGA provides facilities for computing the following quantities for this method:
Quantity Description
d: Transitions & Number of nucleotide substitutions per
Transversions site.
s: Transitions only Number of transitional substitutions per
site.
v: Transversions only Number of transversional substitutions per
site.
R = s/v Transition/transversions ratio.
L: No of valid common Number of sites compared.
sites

Formulas for computing these quantities are as follows:


Distances

where P1 and P2 are the proportions of transitional differences between nucleotides A and G, and between T and C,
respectively, Q is the proportion of transversional differences, gA, gC, gG, gT, are the respective frequencies of A, C, G
and T, gR = gA + gG, gY, = gT + gC, and
104
Variances

where

105
See also Nei and Kumar (2000), page 40.

Maximum Composite Likelihood Method

A composite likelihood is defined as a sum of related log-likelihoods. Since all pairwise distances in a distance matrix
have correlations due to the phylogenetic relationships among the sequences, the sum of their log-likelihoods is
a composite likelihood. Tamura et al. (2004) showed that pairwise distances and the related substitution parameters
are accurately estimated by maximizing the composite likelihood. They also found that, unlike the cases of ordinary
independent estimation of each pairwise distance, a complicated model had virtually no disadvantage in the composite
likelihood method for phylogenetic analyses. Therefore, only the Tamura-Nei (1993) model is available for this
method in MEGA4 (see related Tamura-Nei distance). It assumes equality of substitution pattern among lineages and
of substitution rates among sites (see related gamma model andheterogeneous patterns).

GAMMA DISTANCES

Computing the Gamma Parameter (a)

In the computation of gamma distances, it is necessary to know the gamma parameter (a). This parameter may be
estimated from the dataset under consideration or you may use the value obtained from previous studies. For
estimating a, a substantial number of sequences is necessary; if the number of sequences used is small, the estimate
has a downward bias (Zhang and Gu 1998). The current release of MEGA does not contain any programs for
estimating a; however we plan to make them available in the future. Therefore you need to use another program for
estimating the a value. Some of the frequently used programs that include this facility are PAUP* (Swofford 1998)
for DNA sequences, PAML and PAMP programs for DNA and protein sequences (Yang 1999), and GAMMA
programs from Gu and Zhang (1997).

Equal Input Model (Gamma)

106
In real data, amino acid frequencies usually vary among the different kinds of amino acids and substitution rates are
not uniform among sites. In this case, the correction based on the equal input model gives a better estimate of the
number of amino acidsubstitutions than the Poisson correction distance. The rate variation among sites is modeled
using the Gamma distribution; for computing this distance you will need to provide a gamma parameter (a).

MEGA provides facilities for computing the following quantities:


Quantity Description
d: distance Number of amino acid substitutions per site.
L: No of valid common sitesNumber of sites compared.

Formulas used are:


Distance

where p is the proportion of different amino acid sites, a is the gamma parameter, gi is the frequency of amino
acid i,and

Variance

Jukes-Cantor Gamma distance

In the Jukes and Cantor (1969) model, the rate of nucleotide substitution is the same for all pairs of the four
nucleotides A, T, C, and G. The multiple hit correction equation for this model, which is given below, produces a
maximum likelihood estimate of the number of nucleotide substitutions between two sequences, while relaxing the
assumption that all sites are evolving at the same rate. However, it assumes equal nucleotide frequencies and does not
correct for higher rate of transitional substitutions as compared to transversional substitutions. If the rate variation
among sites is modeled using the Gamma distribution, you will need to provide a gamma parameter (a) for computing
this distance.
The Jukes-Cantor model

107
MEGA provides facilities for computing the following p-distances and related quantities:

d: Transitions + Transversions : Number of nucleotide substitutions per site.


L: No of valid common sites: Number of sites compared.

The formulas for computing these quantities are as follows:


Distance

where p is the proportion of sites with different nucleotides and a is the gamma parameter.
Variance

See also Nei and Kumar (2000), page 36 and estimating gamma parameter.

Kimura gamma distance

Kimura’s two-parameter gamma model corrects for multiple hits, taking into account transitional and transversional
substitution rates and differences in substitution rates among sites. Evolutionary rates among sites are modeled using
the Gamma distribution, and you will need to provide a gamma parameter for computing this distance.

The Kimura 2-parameter model

108
MEGA provides facilities for computing the following quantities:
Quantity Description
d: Transitions + Number of nucleotide substitutions per
Transversions site.
s: Transitions only Number of transitional substitutions per
site.
v: Transversions only Number of transversional substitutions
per site.
R = s/v Transition/transversions ratio.
L: No of valid common Number of sites compared.
sites

The formulas for computing these quantities are as follows:


Distances

109
where P and Q are the respective total frequencies of transition type pairs and transversion type pairs, a is the gamma
parameter, and

Variances

where

See also Nei and Kumar (2000), page 44 and estimating gamma parameter.

Tajima Nei distance (Gamma rates)

110
In real data, nucleotide frequencies often deviate substantially from 0.25. In this case the Tajima-Nei distance (Tajima
and Nei 1984) gives a better estimate of the number of nucleotide substitutions than the Jukes-Cantor distance. Note
that this assumes an equality of substitution rates among sites and
between transitional and transversionalsubstitutions. The rate variation among sites is modeled using the gamma
distribution, and you will need to provide a gamma parameter (a) for computing this distance.

The Felsenstein-Tajima-Nei model

MEGA provides facilities for computing the following quantities for this method:
d: Transitions + Transversions: Number of nucleotide substitutions per site.
L: No of valid common sites: Number of sites compared.
The formulas for computing these quantities are as follows:
Distance

where p is the proportion of sites with different nucleotides, a is the gamma parameter, and

where xij is the relative frequency of the nucleotide pair i and j, gi’s are the nucleotide frequencies.
Variance

111
Tamura-Nei gamma distance

The Tamura-Nei (1993) distance with the gamma model corrects for multiple hits, taking into account the different
rates of substitution between nucleotides and the inequality of nucleotide frequencies. In this distance, evolutionary
rates among sites are modeled using the gamma distribution. You will need to provide a gamma parameter for
computing this distance.

The Tamura-Nei model

MEGA provides facilities for computing the following quantities for this method:
Quantity Description
d: Transitions & Number of nucleotide substitutions per site.
Transversions
s: Transitions only Number of transitional substitutions per site.
v: Transversions only Number of transversional substitutions per
site.
R = s/v Transition/transversions ratio.
L: No of valid common Number of sites compared.
sites

The formulas for computing these quantities are as follows:


Distances

where P1 and P2 are the proportions of transitional differences between nucleotides A and G, and between T and C,
respectively, Q is the proportion of transversional differences, gA, gC, gG, gT, are the respective frequencies of A, C, G
and T, gR = gA + gG, gY, = gT + gC, a is the gamma parameter and

112
Variances

where

113
See also Nei and Kumar (2000), page 45 and estimating gamma parameter.

Tamura 3-parameter (Gamma)

Tamura’s 3-parameter model corrects for multiple hits, taking into account the differences in transitional
and transversional rates and the G+C-content bias (1992). Evolutionary rates among sites are modeled using the
gamma distribution, and you will need to provide a gamma parameter for computing this distance.

The Tamura 3-parameter model

MEGA provides facilities for computing the following quantities:


Quantity Description

114
d: Transitions & Number of nucleotide substitutions per site.
Transversions
s: Transitions only Number of transitional substitutions per site.
v: Transversions only Number of transversional substitutions per
site.
R = s/v Transition/transversions ratio.
L: No of valid common sites Number of sites compared.

The formulas for computing these quantities are as follows:


Distances

where P and Q are the proportion of sites with transitional and transversional differences, respectively, a is the gamma
parameter, and

Variances

115
where

Maximum Composite Likelihood (Gamma Rates)

The Tamura-Nei (1993) distance with the gamma model estimated by the composite likelihood method (Tamura et al.
2004) corrects for multiple hits, taking into account the different rates of substitution between nucleotides and the
inequality of nucleotide frequencies. In this distance, evolutionary rates among sites are modeled using the gamma
distribution. You will need to provide a gamma parameter for computing this distance. See related Tamura-Nei
gamma distance.

HETEROGENEOUS PATTERNS

Tajima Nei Distance (Heterogeneous patterns)

116
In real data, nucleotide frequencies often deviate substantially from 0.25. In this case the Tajima-Nei distance (Tajima
and Nei 1984) gives a better estimate of the number of nucleotide substitutions than the Jukes-Cantor distance. Note
that this assumes an equality of substitution rates among sites and between transitional and transversionalsubstitutions.
When the nucleotide frequencies are different between the sequences, the modified formula (Tamura and Kumar
2002) relaxes the assumption of substitution pattern homogeneity.

The Felsenstein-Tajima-Nei model

MEGA provides facilities for computing the following quantities for this method:
d: Transitions + Transversions: Number of nucleotide substitutions per site.
L: No of valid common sites: Number of sites compared.

Formulas for computing these quantities are as follows:


Distance

where p is the proportion of sites with different nucleotides and

where xij is the relative frequency of the nucleotide pair i and j, gi’s are the nucleotide frequencies.
Variance can be estimated by the bootstrap method.

Tamura 3 parameter (Heterogeneous patterns)

Tamura’s 3-parameter model corrects for multiple hits, taking into account the differences in transitional
and transversional rates and the G+C-content bias (1992). It assumes an equality of substitution rates among

117
sites. When the G+C-contents are different between the sequences, the modified formula (Tamura and Kumar 2002)
relaxes the assumption of substitution pattern homogeneity.

The Tamura 3-parameter model

MEGA provides facilities for computing the following quantities:


Quantity Description
d: Transitions & Number of nucleotide substitutions per site.
Transversions
s: Transitions only Number of transitional substitutions per site.
v: Transversions only Number of transversional substitutions per
site.
R = s/v Transition/transversions ratio.
L: No of valid common sites Number of sites compared.

Formulas for computing these quantities are as follows:


Distances

where P and Q are the proportion of sites with transitional and transversional differences, respectively, and

118
The variances can be estimated by the bootstrap method. .

Tamura-Nei distance (Heterogeneous Patterns)

The Tamura-Nei model (1993) corrects for multiple hits, taking into account the substitution rate differences between
nucleotides and the inequality of nucleotide frequencies. It distinguishes between transitional substitution rates
between purines and transversional substitution rates between pyrimidines. It assumes an equality of substitution rates
among sites (see related gamma model).When nucleotide frequencies are different between the sequences, the
modified formula (Tamura and Kumar 2002) relaxes the assumption of substitution pattern homogeneity.

The Tamura-Nei model

MEGA provides facilities for computing the following quantities for this method:
Quantity Description
d: Transitions & Number of nucleotide substitutions per
Transversions site.
s: Transitions only Number of transitional substitutions per
site.
v: Transversions only Number of transversional substitutions per
site.

119
R = s/v Transition/transversions ratio.
L: No of valid common Number of sites compared.
sites

Formulas for computing these quantities are as follows:


Distances

where P1 and P2 are the proportions of transitional differences between nucleotides A and G, and between T and C,
respectively, Q is the proportion of transversional differences, gXA, gXC, gXG, gXT, are the respective frequencies of A, C,
G and T of sequence X, gXR = gXA + gXG and gXY = gXT + gXC, gA, gC, gG, gT, gR, and gY are the average frequencies of the
pair of sequences, and

The variances can be estimated by the bootstrap method.

Maximum Composite Likelihood (Heterogeneous Patterns)

The Tamura-Nei distance (1993) estimated by the composite likelihood method (Tamura et al. 2004) corrects for
multiple hits, taking into account the substitution rate differences between nucleotides and the inequality of nucleotide
frequencies. When the nucleotide frequencies between the sequences are different, the expected proportions of
observed differences (P1, P2, and Q) in the computation of the composite likelihood can be obtained by the modified
formulas according to Tamura and Kumar (2002) to relax the assumption of the substitution pattern homogeneity. See
related Tamura-Nei distance (Heterogeneous Patterns).

120
GAMMA RATES

Equal Input Model (Heterogeneous Patterns)

In real data, amino acid frequencies usually vary among different kinds of amino acids. In this case, a correction based
on the equal input model gives a better estimate of the number of amino acid substitutions than does the Poisson
correction distance. Note that this assumes an equality of substitution rates among sites. When the amino acid
frequencies are different between the sequences, the modified formula (Tamura and Kumar 2002) relaxes the
estimation bias.

MEGA provides facilities for computing the following quantities:


Quantity Description
d: distance Number of amino acid substitutions per site.
L: No of valid common sitesNumber of sites compared.

Formulas used are:


Distance

where p is the proportion of different amino acid sites, gXi is the frequency of amino acid i for sequence X, gi is the
average frequency for the pair of the sequences, and

The variance of d can be estimated by the bootstrap method.

Tajima Nei Distance (Gamma Rates and Heterogeneous patterns)

In real data, nucleotide frequencies often deviate substantially from 0.25. In this case the Tajima-Nei distance (Tajima
and Nei 1984) gives a better estimate of the number of nucleotide substitutions than the Jukes-Cantor distance. Note
that this assumes an equality of substitution rates among sites and
between transitional and transversionalsubstitutions. The rate variation among sites is modeled using the gamma
distribution, and you will need to provide a gamma parameter (a) for computing this distance.When the nucleotide
frequencies are different between the sequences, the modified formula (Tamura and Kumar 2002) relaxes the
assumption of substitution pattern homogeneity.

The Felsenstein-Tajima-Nei model

121
MEGA provides facilities for computing the following quantities for this method:
d: Transitions + Transversions: Number of nucleotide substitutions per site.
L: No of valid common sites: Number of sites compared.

The formulas for computing these quantities are as follows:


Distance

where p is the proportion of sites with different nucleotides, a is the gamma parameter, and

where xij is the relative frequency of the nucleotide pair i and j, gi’s are the nucleotide frequencies.
Variance can be estimated by the bootstrap method.

Tamura-Nei distance (Gamma rates and Heterogeneous patterns)

The Tamura-Nei (1993) distance with the gamma model corrects for multiple hits, taking into account the rate
substitution differences between nucleotides and the inequality of nucleotide frequencies. In this distance,
evolutionary rates among sites are modeled using the gamma distribution. You will need to provide a gamma
parameter for computing this distance. When the nucleotide frequencies between the sequences are different, the
modified formula (Tamura and Kumar 2002) relaxes the assumption of the substitution pattern homogeneity.

The Tamura-Nei model

122
MEGA provides facilities for computing the following quantities for this method:

Quantity Description
d: Transitions & Number of nucleotide substitutions per site.
Transversions
s: Transitions only Number of transitional substitutions per site.
v: Transversions only Number of transversional substitutions per
site.
R = s/v Transition/transversions ratio.
L: No of valid common Number of sites compared.
sites

The formulas for computing these quantities are as follows:


Distances

where P1 and P2 are the proportions of transitional differences between nucleotides A and G, and between T and C,
respectively, Q is the proportion of transversional differences, gXA, gXC, gXG, gXT, are the respective frequencies of A, C,
G and T of sequence X, gXR = gXA + gXG and gXY = gXT + gXC, gA, gC, gG, gT, gR, and gY are the average frequencies of the
pair of sequences, a is the gamma parameter and

123
The variances can be estimated by the bootstrap method.

Tamura 3 parameter (Gamma rates and Heterogeneous patterns)

Tamura’s 3-parameter model corrects for multiple hits, taking into account the differences in transitional
and transversional rates and the G+C-content bias (1992). Evolutionary rates among sites are modeled using the
gamma distribution, and you will need to provide a gamma parameter for computing this distance. When theG+C-
contents between the sequences are different, the modified formula (Tamura and Kumar 2002) relaxes the assumption
of substitution pattern homogeneity.

The Tamura 3-parameter model

MEGA provides facilities for computing the following quantities:


Quantity Description
d: Transitions & Number of nucleotide substitutions per site.
Transversions
124
s: Transitions only Number of transitional substitutions per site.
v: Transversions only Number of transversional substitutions per
site.
R = s/v Transition/transversion ratio.
L: No of valid common sites Number of sites compared.

Formulas for computing these quantities are as follows:


Distances

where P and Q are the proportion of sites with transitional and transversional differences, respectively, a is the gamma
parameter, and

The variances can be estimated by the bootstrap method.

125
Maximum Composite Likelihood (Gamma Rates and Heterogeneous Patterns)

The Tamura-Nei (1993) distance estimated by the composite likelihood method (Tamura et al. 2004) with the gamma
model corrects for multiple hits, taking into account the rate substitution differences between nucleotides and the
inequality of nucleotide frequencies. In this distance, evolutionary rates among sites are modeled using the gamma
distribution. You will need to provide a gamma parameter for computing this distance. When the nucleotide
frequencies between the sequences are different, the expected proportions of observed differences (P1, P2, and Q) in
the computation of the composite likelihood can be obtained by the modified formulas according to Tamura and
Kumar (2002) to relax the assumption of the substitution pattern homogeneity.

AMINO ACID SUBSTITUTION MODELS

No. of differences (Amino acids)

This distance is the number of sites at which two sequences being compared are different. If the sequences
contain alignment gaps or missing data and you are using the pairwise deletion option, you must realize that this count
does not normalize the number of differences based on the number of valid sites compared. Therefore, if you use this
distance, we recommend that you use the complete-deletion option.

MEGA computes the following quantities:


Quantity Description
d: distance Number of sites different.
L: No of valid common sitesNumber of sites compared.

The formulas used are:


QuantityFormulaVariance
nd None nd(L – nd)/L

See also Nei and Kumar (2000), page 18.

p-distance (Amino acids)

This distance is the proportion (p) of amino acid sites at which the two sequences to be compared are different. It is
obtained by dividing the number of amino acid differences by the total number of sites compared. It does not make
any correction for multiple substitutions at the same site or differences in evolutionary rates among sites.

MEGA provides facilities to compute the following quantities:


Quantity Description
d: distance Proportion of amino acid sites different.
L: No of valid common sitesNumber of sites compared.

The formulas used are:


QuantityFormulaVariance
p nd/L P(1 – p)/L
where nd is the number of amino acids that are different between two aligned sequences.

See also Nei and Kumar (2000), page 18.

Equal Input Model (Amino acids)

In real data, frequencies usually vary among different kind of amino acids. In this case, the correction based on the
equal input model gives a better estimate of the number ofamino acid substitutions than the Poisson

126
correction distance. Note that this assumes an equality of substitution rates among sites and the homogeneity of
substitution patterns between lineages.

MEGA provides facilities to compute the following quantities:


Quantity Description
d: distance Number of amino acid substitutions per site.
L: No of valid common sitesNumber of sites compared.

The formulas used are:


Distance

where p is the proportion of different amino acid sites, gi is the frequency of amino acid i, and

Variance

Poisson Correction (PC) distance

The Poisson correction distance assumes equality of substitution rates among sites and equal amino acid frequencies
while correcting for multiple substitutions at the same site.

MEGA provides facilities to compute the following quantities:


Quantity Description
d: distance Number of amino acid substitutions per site.
L: No of valid common sitesNumber of sites compared.

Formulas used are:


QuantityFormula Variance
d -ln(1 – p)p/[(1 - p)/L]

See also Nei and Kumar (2000), page 20.

Dayhoff and JTT Models

The PAM and JTT distances correct for multiple substitutions based on the model of amino acid substitution
described as substitution-rate matrices. The PAM distance uses the PAM 001 matrix (p. 348 in Dayhoff 1979) and
the JTT distance uses the JTT matrix (Jones et al. 1992). Using a substitution-rate matrix (Q), the matrix (F), which
consists of the observed proportions of amino acid pairs between a pair of sequences with their divergence time t, is
given by the following equation

127
where A denotes the diagonal matrix of the equilibrium amino acid frequencies for Q. From this equation, the
evolutionary distance d = 2tQ can be iteratively computed by a maximum-likelihood method. The eigen values for
the PAM and JTT matrices required in this computation were obtained from the program source code of PHYLIP
version 3.6 (Felsenstein et al. 1993-2001).

MEGA provides facilities for computing the following quantities:


Quantity Description
d: distance Number of amino acid substitutions per site.
L: No of valid common sitesNumber of sites compared.

The variance of d can be estimated by the bootstrap method.

GAMMA DISTANCES

Computing the Gamma Parameter (a)

In the computation of gamma distances, it is necessary to know the gamma parameter (a). This parameter may be
estimated from the dataset under consideration or you may use the value obtained from previous studies. For
estimating a, a substantial number of sequences is necessary; if the number of sequences used is small, the estimate
has a downward bias (Zhang and Gu 1998). The current release of MEGA does not contain any programs for
estimating a; however we plan to make them available in the future. Therefore you need to use another program for
estimating the a value. Some of the frequently used programs that include this facility are PAUP* (Swofford 1998)
for DNA sequences, PAML and PAMP programs for DNA and protein sequences (Yang 1999), and GAMMA
programs from Gu and Zhang (1997).

Dayhoff and JTT distances (Gamma rates)

The PAM and JTT distances correct for multiple substitutions based on a model of amino acid substitution
described as substitution-rate matrices. The PAM distance uses PAM 001 matrix (p. 348 in Dayhoff 1979) and the
JTT distance uses JTT matrix (Jones et al. 1992). The matrix (F) uses a substitution-rate matrix (Q) and the gamma
distribution with parameter a for the rate variation among sites. It consists of the observed proportions of amino
acid pairs with their divergence time t, given by the following equation

where A denotes the diagonal matrix of the equilibrium amino acid frequencies for Q. From this equation, the
evolutionary distance d = 2tQ can be computed iteratively by a maximum-likelihood method. The eigen values for
the PAM and JTT matrices required in this computation were obtained from the program source code of PHYLIP
version 3.6 (Felsenstein et al. 1993-2001).

MEGA provides facilities for computing the following quantities:


Quantity Description
d: distance Number of amino acid substitutions per site.
L: No of valid common sitesNumber of sites compared.

The variance of d can be estimated by the bootstrap method.

Gamma distance (Amino acids)


128
The Gamma distance improves upon the Poisson correction distance by taking care of the inequality of the substitution
rates among sites. For this purpose, you will need to provide the gamma shape parameter (a).

For estimating the Dayhoff distance, use a = 2.25 (see Nei and Kumar [2000], page 21 for details).
For computing Grishin’s distance, use a = 0.65. 23 (see Nei and Kumar [2000], page 23 for details)

MEGA provides facilities to compute the following quantities:


Quantity Description
d: distance Number of amino acid substitutions per site.
L: No of valid common sitesNumber of sites compared.

Formulas used are:


QuantityFormula Variance
d a[(1 - p)-1/a - 1]p[(1 - p)-(1 + 2/a)]/L

See also Nei and Kumar (2000), page 23 and estimating gamma parameter.

HETEROGENOUS PATTERNS

Equal Input Model (Heterogeneous Patterns)

In real data, amino acid frequencies usually vary among different kinds of amino acids. In this case, a correction based
on the equal input model gives a better estimate of the number of amino acid substitutions than does the Poisson
correction distance. Note that this assumes an equality of substitution rates among sites. When the amino acid
frequencies are different between the sequences, the modified formula (Tamura and Kumar 2002) relaxes the
estimation bias.

MEGA provides facilities for computing the following quantities:


Quantity Description
d: distance Number of amino acid substitutions per site.
L: No of valid common sitesNumber of sites compared.

Formulas used are:


Distance

where p is the proportion of different amino acid sites, gXi is the frequency of amino acid i for sequence X, gi is the
average frequency for the pair of the sequences, and

The variance of d can be estimated by the bootstrap method.

SYNONYMOUS AND NONSYNONYMOUS SUBSTITUTION MODELS

129
Nei-Gojobori Method

This method computes the numbers of synonymous and nonsynonymous substitutions and the numbers of potentially
synonymous and potentially nonsynonymous sites (Nei and Gojobori 1986). Based on these estimates, MEGA can be
asked to produce the following quantities:

Number of differences (Sd or Nd)


These are simple counts of the number of synonymous (Sd) and nonsynonymous (Nd) differences. To
compare these two numbers, you must use the p-distance because the number of potential synonymous sites
is much smaller than the number of nonsynonymous sites.

p-distance (pS or pN)


The count of the number of synonymous differences (Sd) is normalized using the possible number of
synonymous sites (S). A similar computation can be made for nonsynonymous differences.
Jukes-Cantor correction (dS or dN)
The p-distances computed above can be corrected to account for multiple substitutions at the same site.

Difference between synonymous and nonsynonymous distances


MEGA can compute differences between the synonymous and nonsynonymous distances. These statistics are
useful in conducting tests for selection.

Number of Sites (S or N)
The numbers of potential synonymous and nonsynonymous sites can be computed using this option. For
each pair of sequences, the average number of synonymous or nonsynonymous sites is reported.

The formulas for computing these quantities are:

QuantityFormula Variance
PS S d /S V(pS) = pS(1 – pS)/S
pN Nd/N V(pN) = pN(1 – pN)/N
dS -3/4ln(1 – 4/3PS) V(dS) = pS(1 – pS)/[(1 – 4/3pS)2S]
dN -3/4ln(1 – 4/3PN)V(dN) = pN(1 – pN)/[(1 – 4/3pN)2N]
Dp pN - pS V(pN) + (V(pS)
Dd dN - dS V(dN) + (V(dS)

See also Nei and Kumar (2000), page 52

Modified Nei-Gojobori Method

The modified Nei-Gojobori distance differs from the original Nei-Gojobori formulation in one way: transitional
and transversional substitutions are no longer assumed to occur with the same frequency. Thus the user is requested to
provide the Transition/Transversion (R) ratio. When R = 0.5, this method becomes identical to the Nei-Gojobori
method. When R > 0.5, the number of synonymous sites is less than estimated using Nei-Gojobori method and
consequently, the number of nonsynonymous sites will be larger than estimated with the original Nei-Gojobori (Nei
and Gojobori 1986) approach.

Number of differences (Sd or Nd)


These are counts of the numbers of synonymous (Sd) and nonsynonymous (Nd) differences. To compare
these two numbers you must use the p-distance because the number of potential synonymous sites is much
smaller than the number of nonsynonymous sites.

p-distance (pS or pN)


The count of the number of synonymous differences (Sd) is normalized using the number of potential
synonymous sites (S). A similar computation can be made for nonsynonymous differences.

130
Jukes-Cantor correction (dS or dN)
The p-distances computed above can be corrected to account for multiple substitutions at the same site.

Difference between synonymous and nonsynonymous distances


MEGA can compute differences between synonymous and nonsynonymous distances. These statistics are
useful when conducting tests for selection.
Number of Sites (S or N)
Numbers of potentially synonymous and nonsynonymous sites can be computed using this option. For each
pair of sequences, the average number of synonymous or nonsynonymous sites is reported.

The formulas for computing these quantities are:

Quantity Variance
Formula
pS Sd/SR V(pS) = pS(1 – pS)/SR
pN Nd/NR V(pN) = pN(1 – pN)/NR
dS -3/4ln(1 – 4/3pS) V(dS) = pS(1 – pS)/[(1 – 4/3pS)2SR]
dN -3/4ln(1 – 4/3pN)V(dN) = pN(1 – pN)/[(1 – 4/3pN)2NR]
pN - pS V(pN) + V(pS)
D dN - dS V(dN) + V(dS)

See also Nei and Kumar (2000), page 52.

Li-Wu-Luo Method

In this method (Li et al 1985), each site in a codon is allocated to 0-fold, 2-fold or 4-fold degenerate categories. For
computing distances, all 0-fold and two-thirds of the 2-fold sites are considered nonsynonymous, whereas one-third of
the 2-fold and all of the 4-fold sites are consideredsynonymous. The observed transitional and transversional
differences between codons then are partitioned into those occurring at 0-fold, 2-fold and 4-fold degenerate
sites. Based on this information, the following quantities can be estimated.

Synonymous distance
This is the number of synonymous substitutions per synonymous site.

Nonsynonymous distance
This is the number of nonsynonymous substitutions per nonsynonymous site.

Substitutions at the 4-fold degenerate sites


This is the number of substitutions per 4-fold degenerate site; it is useful for measuring the rate of neutral
evolution.

Substitutions at the 0-fold degenerate sites


This is the number of substitutions per 0-fold degenerate site; it is useful for measuring the rate of amino
acid sequence evolution.

Number of 4-fold degenerate sites


This is the estimate of the number of 4-fold degenerate sites, computed by averaging the number of 4-fold
degenerate sites in the two sequences, compared.

Number of 0-fold degenerate sites


This is the estimate of the number of 0-fold degenerate sites, computed by averaging the number of 0-fold
degenerate sites in the two sequences, compared.

Difference between synonymous and nonsynonymous distances


This computes the differences between the synonymous and nonsynonymous distances. These statistics are
useful for conducting tests of selection.

131
The formulas for computing these quantities are:

QuantityFormula Variance
dS 3[L2A2 + L4(A4 + B4)]/(L2 + 3L4) 9[L22V(A2) + L42V(A4 + B4)]/(L2 + 3L4)2
dN 3[L2B2 + L0(A0 + B0)]/(2L2 + 3L0)9[L22V(B2) + L20V(A0 + B0)]/(2L2 + 3L0)2
d4 A4 + B4 [a42P4 + k42Q4 – (a4P4 + k42Q4)2]/L
d0 A0 + B0 [a20P0 + k02Q0 – (a0P0 + k02Q0)2]/L
D dN - dS V(dN) + V(dS)

Here,
L0, L2, and L4 are the number of 0-fold, 2-fold and 4-fold degenerate sites, respectively.
Ai – 1/2ln(ai) = 1/4ln(bi), and
Bi = 1/2ln(bi), where
ai = 1/(1 – 2Pi = Qi), bi = 1/(1 – 2Qi), ci = (ai – bi)/2, ki = (ai + bi)/2

Pi and Qi are the proportions of i-fold degenerate sites that show transitional and transversional differences,
respectively.

V(Ai) = [ai2Pi + ci2Qi = (aiPi + ciQi)2]Li,


V(Bi) = bi2Qi(1 – Qi)/Li

See also Nei and Kumar (2000), page 62.

Pamilo-Bianchi-Li Method

This method (Pamilo and Bianchi 1993; Li 1993) is a modification of Li, Wu and Luo's method. The only difference
concerns the allocation of 2-fold sites to synonymous and nonsynonymous categories. Rather than assuming an
equal transition and transversion rate, the rate is inferred from the observed number of transitions and transversions at
the 4-fold degenerate sites. Based on this information, the following quantities can be estimated:

Synonymous distance
This is the number of synonymous substitutions per synonymous site.

Nonsynonymous distance
This is the number of nonsynonymous substitutions per nonsynonymous site.

Substitutions at the 4-fold degenerate sites (d4)


This is the number of substitutions per 4-fold degenerate site; it is useful for measuring the rate of neutral
evolution.

Substitutions at the 0-fold degenerate sites (d0)


This is the number of substitutions per 0-fold degenerate site; it is useful for measuring the rate of amino
acid sequence evolution.

Number of 4-fold degenerate sites(L4)


The estimate of the number of 4-fold degenerate sites, computed by averaging the number of 4-fold
degenerate sites in the two sequences, compared.

Number of 0-fold degenerate sites (L0)


The estimate of the number of 0-fold degenerate sites, computed by averaging the number of 0-fold
degenerate sites in the two sequences, compared.

Difference between synonymous and nonsynonymous distances (D)


This computes the differences between the synonymous and nonsynonymous distances. These statistics are
useful for conducting tests of selection.

132
The formulas for computing these quantities are:

QuantityFormula Variance
dN B4 + (L2A2 + L4A4)/(L2 + L4)V(B4) + [L22V(A2) + L42V(A4)]/(L2 + L4)2 – b4Q4[2a4P4 – c4(1 – Q4)]/(L2 + L4)
dS A0 + (L0B0 + L2B2)/(L0 + L2)V(A0) + [L02V(B0) + L22V(B2)]/(L0 + L2)2 – b0Q0[2a0P0 – c0(1 – Q0)]/(L0 + L2)
d4 A4 + B4 [a42P4 + k42Q4 = (a4P4 + k42Q4)2]/L
d0 A0 + B0 [a02P0 + k02Q0 = (a0P0 + k02Q0)2]/L
D dS - dN V(dS) + V(dN) = 2cov(dS, dN)
1 1
Ai /2ln(ai) = /4ln(bi) V(Ai) = [ai2Pi + ci2Qi – (aiPi + ciQi)2]/Li
1
Bi /2ln(bi) V(Bi) = bi2Qi(1 – Qi)/Li

Here,
L0, L2, and L4 are the number of 0-fold, 2-fold and 4-fold degenerate sites, respectively.
Ai = 1/2ln(ai) = 1/4ln(bi), and
Bi = 1/2ln(bi), where
ai = 1/(1 = 2Pi = Qi), bi = 1/(1 = 2Qi), ci = (ai = bi)/2, ki = (ai + bi)/2

Pi and Qi are the proportions of i-fold degenerate sites that show transitional and transversional differences,
respectively.

V(Ai) = [ai2Pi + ci2Qi – (aiPi + ciQi)2]/Li


V(Bi) = bi2Qi(1 – Qi)/Li

See also Nei and Kumar (2000), page 64.

Kumar Method

This method is a modification of the Pamilo-Bianchi-Li and Comeron (1995) methods and is able to handle some
problematic degeneracy class assignments (see a detailed description below). It computes the following quantities:

Synonymous distance
This is the number of synonymous substitutions per synonymous site.

Nonsynonymous distance
This is the number of nonsynonymous substitutions per nonsynonymous site.

Substitutions at the 4-fold degenerate sites


This is the number of substitutions per 4-fold degenerate site. It is useful for measuring the rate of neutral
evolution.

Substitutions at the 0-fold degenerate sites


This is the number of substitutions per 0-fold degenerate site. It is useful for measuring the rate of amino
acid sequence evolution.

Number of 4-fold degenerate sites


This is the estimate of the number of 4-fold degenerate sites, computed by averaging the number of 4-fold
degenerate sites in the two sequences, compared.

Number of 0-fold degenerate sites


This is the estimate of the number of 0-fold degenerate sites, computed by averaging the number of 0-fold
degenerate sites in the two sequences, compared.

Difference between synonymous and nonsynonymous distances


This computes the differences between the synonymous and nonsynonymous distances. These statistics are
useful for conducting tests of selection.
133
Kumar’s modification of the PBL method:
The treatment of arginine and isoleucine codons in the Li-Wu-Luo and the Pamilo-Bianchi-Li methods is arbitrary,
which sometimes creates a problem because the arginine codons occur quite frequently. Comeron (1995) addressed
this problem by dividing the 2-fold degenerate sites into two groups: 2S-fold and 2V-fold. The 2S-fold refers to sites
in which the transitional change is synonymous and the two transversional changes are nonsynonymous, whereas the
2V-fold represents sites in which the transitional change is nonsynonymous and the transversional changes are
synonymous. Although these definitions help in correcting some of the inaccurate classifications of synonymous and
nonsynonymous sites (e.g., methionine codons), they do not solve the problem completely. For example, consider
mutations in the first nucleotide position of the arginine codon: CGG produces TGG (Trp), AGG (Arg), or GGG
(Gly). The transitional change (C to T) results in a nonsynonymous change. Of the two transversional substitutions,
one (C to A) results in a synonymous change, while the other (C to G) results in a nonsynonymous change. Therefore,
this nucleotide site is neither a 2S-fold nor a 2V-fold site. Thus, the first position of three arginine codons (CGU,
CGC, and CGA) and the third position of two isoleucine codons (ATT and ATC) cannot be assigned to any of the
Comeron (1995) categories. For this reason, Comeron (personal communication) used a more complicated
classification of codons when he wrote his computer program. For example, the first position of arginine codon CGG
was assigned to a 2V-fold site with a probability of one-third and to a 0-fold site with a probability of two-
thirds. Similar assignments are used by W.-H. Li (personal communication) in his computer program.
Since the nucleotide site assignments discussed above are quite arbitrary and may not apply to all known genetic
code tables, Kumar developed another method that uses the PBL method for any genetic code table. In this version,
nucleotide sites are first classified into 0-fold, 2-fold, and 4-fold degenerate sites. The 2-fold degenerate sites are
further subdivided into simple 2-fold and complex 2-fold degenerate sites. Simple 2-fold sites are those at which
the transitional change results in a synonymous substitution and the two transversional changes result in
nonsynonymous substitutions. All other 2-fold sites, including those for the three isoleucine codons, belong to the
complex 2-fold site category. If we use this definition, all nucleotide sites can be classified into the five groups
shown in the following table.

Table.
Degeneracy -> 0-fold Simple 2-fold Complex 2-fold 4-fold
No. of sites -> L0 L2S L2C L4
Syn Nonsyn
Transition (s) s0 s2 s2S s2N s4
Transversion (v) v0 V2 v2S v2N v4
Here,L0, L2S, L2C, and L4 are the numbers of 0-fold, simple 2-fold, complex 2-fold, and 4-fold degenerate sites,
respectively.

Once this table is filled using the observed counts for a given pair of sequences, we compute the proportions of
transitional (Pi) and transversional (Qi) differences for the i-fold degenerate site in the following way:

P0 = (s0 + s2N)/(L0 + L2C) Q0 = vo/L0


P2 = (s0 + s2S)/(L2S + L2C) Q2 = (v2 + v2N)/(L2S + L2C)
P4 = s4/L4 Q4 = (v4 + v2S)/(L4 + L2C)
From these quantities, we compute the Ai and Bi as in the PBL method. Then using L2 = L2C + L2S, we apply the
formulas for the PBL method.

See also Nei and Kumar (2000), page 64.

SPECIFYING DISTANCE ESTIMATION OPTIONS

Analysis Preferences (Distance Computation)

134
In this dialog box you can select and view the desired options in the Options Summary. Options are organized in
logical sections. A yellow row indicates that you have a choice regarding the attribute in that row. The three primary
sets of options available in this dialog box are:
Analysis
Variance Estimation Method
Use this to specify whether to compute Distances only or Distances and Standard Errors using the selected
estimation method. If you select the latter, then you are given a choice as to how to compute it in the No. of
Bootstrap Replications box.
When you compute average distance or diversity, only the bootstrap method is available for computing
standard errors.
Substitution Model
In this set of options, you choose the various attributes of the substitution models.
Substitutions Type
Here you may select a substitutions type of Nucleotide, Syn-Nonsynonymous, or Amino Acid. The selection
in this row effects the available models in the model row.
Model
Here you select a stochastic model for estimating evolutionary distance by clicking on the row then selecting a
model for the current Substitutions Type.
Substitutions to Include
Depending on the distance model or method selected, the evolutionary distance can be teased into two or more
components. By clicking on the row, you will be provided with a list of components relevant to the chosen
model.
Transition/Transversion Ratio
This option will be visible if the chosen model requires you to provide a value for the Transition/Transversion
ratio (R).
Pattern among Lineages
This option becomes available if the selected model has formulas that allow the relaxation of the assumption
of homogeneity of substitution patterns among lineages.
Rates among Sites
This option becomes available if the selected distance model has formulas that allow rate variation among
sites. If you choose gamma-distributed rates, then the Gamma parameter option becomes visible.
Data Subset to Use
These are options for handling gaps or missing data, including or excluding codon positions, and restricting
the analysis to labeled sites, if applicable.
Gaps and Missing Data
You may choose to remove all sites containing alignment gaps and missing information before the calculation
begins (Complete-deletionoption). Alternatively, you may choose to retain all such sites initially, excluding
them as necessary in the pairwise distance estimation (Pairwise-deletion option), or you may use Partial
Deletion (Site coverage) as a percentage.
Codon Positions
Check or uncheck the boxes for any combination of 1st, 2nd, 3rd, and non-coding positions for analysis. This
option is available only if the nucleotide sequences contain protein-coding regions and you have selected a
nucleotide-by-nucleotide analysis.
Labeled Sites
This option is available only if some or all of the sites have associated labels. By clicking on the row, you
will be provided with the option of including sites with selected labels. If you choose to include only labeled
sites, then these sites will be the first extracted from the data. Then all other options mentioned above will be
enforced. Note that labels associated with all three positions in the codon must be included for a full codon to
be incorporated in the analysis.

Distance Model Options


With this option, you can choose the general attributes of the substitution models for DNA and protein sequence
evolution.

Substitutions Type
135
Here you may select a substitutions type of Nucleotide, Syn-Nonsynonymous, or Amino Acid. The selection in this
row affects the available models in the model row.
Model
You can select a stochastic model for estimating evolutionary distances by clicking on the row then selecting a model
for the current Substitutions Type.
Transition/Transversion Ratio
This option will be visible if the chosen model requires you to provide a value for the Transition/Transversion ratio
(R).
Pattern among Lineages
This option becomes available if the distance model you have selected has formulas that allow the relaxation of the
assumption of homogeneity of substitution patterns among lineages.
Rates among Sites
This option becomes available if the distance model you have selected has formulas that allow rate variation among
sites. If you choose gamma distributed rates, then the Gamma parameter option becomes visible.

Bootstrap method to compute standard error of distance estimates

When you choose the bootstrap method for estimating the standard error, you must specify the number of replicates
and the seed for the pseudorandom number generator. In each bootstrap replicate, the desired quantity is estimated
and the standard deviation of the original values is computed (see Nei and Kumar [2000], page 25 for details).
It is possible that in some bootstrap replicates the quantity you desire is not calculable for statistical or technical
reasons. In these cases, MEGA will discard the results of the bootstrap replicates and its final estimate will be the
results of all valid replicates. This means that the number of bootstrap replicates used can be smaller than the number
specified by the user. However, if the number of valid bootstrap replicates is < 25, then MEGA will report that the
standard error cannot be computed (an “n/c” swill appear in the result window).

CONSTRUCTING PHYLOGENETIC TREES

Phylogenetic Inference
Reconstruction of the evolutionary history of genes and species is currently one of the most important subjects in
molecular evolution. If reliable phylogenies are produced, they will shed light on the sequence of evolutionary events
that generated the present day diversity of genes and species and help us to understand the mechanisms of evolution as
well as the history of organisms.
Phylogenetic relationships of genes or organisms usually are presented in a treelike form with a root, which is called
a rooted tree. It also is possible to draw a tree without a root, which is called an unrooted tree. The branching pattern
of a tree is called a topology.
There are numerous methods for constructing phylogenetic trees from molecular data (Nei and Kumar 2000). They
can be classified into Distance methods, Parsimony methods, and Likelihood methods. These methods are explained
in Swofford et al. 1996, Li (1997), Page and Holmes (1998), and Nei and Kumar (2000).

NJ / UPGMA METHODS

Analysis Preferences (NJ/UPGMA)


In this dialog box, you can view and select desired options in the Options Summary. Options are organized in
logical sections. A yellow row indicates that you have a choice for that attribute. The three primary sets of options
available in this dialog box are:
Phylogeny Test and Options

136
To assess the reliability of a phylogenetic tree, MEGA provides the Bootstrap test. This test uses the
bootstrap re-sampling strategy, so you need to enter the number of replicates. For a given data set applicable
tests and the phylogeny inference method are enabled. Neighbor joining has an additional
test Interior Branch which requires the same input as bootstrap.
Substitution Model
In this set of options, you can choose various attributes of the substitution models for DNA and protein
sequences.
Substitutions Type
Here you may select a substitutions type of Nucleotide, Syn-Nonsynonymous, or Amino Acid. The selection
in this row effects the available models in the model row.
Model
Here you select a stochastic model for estimating evolutionary distance by clicking on the row then selecting a
model for the current Substitutions Type.
Substitutions to Include
Depending on the distance model or method selected, the evolutionary distance can be teased into two or more
components. By clicking on the row, you will be provided with a list of components relevant to the chosen
model.
Transition/Transversion Ratio
This option will be visible if the chosen model requires you to provide a value for
the Transition/Transversion ratio (R).
Pattern among Lineages
This option becomes available if the selected model has formulas that allow the relaxation of the assumption
of homogeneity of substitution patterns among lineages.
Rates among Sites
This option becomes available if the selected distance model has formulas that allow rate variation among
sites. If you choose gamma-distributed rates, then the Gamma parameter option becomes visible.
Data Subset to Use
These are options for handling gaps and missing data, including or excluding codon positions, and restricting
the analysis to labeled sites, if applicable.
Gaps and Missing Data
You may choose to remove all sites containing alignment gaps and missing information before the calculation
begins (Complete-deletionoption). Alternatively, you may choose to retain all such sites initially, excluding
them as necessary in the pairwise distance estimation (Pairwise-deletion option), or you may use Partial
Deletion (Site coverage) as a percentage.
Codon Positions
Check or uncheck the boxes for any combination of 1st, 2nd, 3rd, and non-coding positions for analysis. This
option is available only if the nucleotide sequences contain protein-coding regions and you have selected a
nucleotide-by-nucleotide analysis.
Labeled Sites
This option is available only if some or all of the sites have associated labels. By clicking on the row, you
will be provided with the option of including sites with selected labels. If you choose to include only labeled
sites, then these sites will be the first extracted from the data. Then all other options mentioned above will be
enforced. Note that labels associated with all three positions in the codon must be included for a full codon to
be incorporated in the analysis.

MINIMUM EVOLUTION METHOD

Minimum Evolution
In the ME method, distance measures that correct for multiple hits at the same sites are used, and a topology showing
the smallest value of the sum of all branches (S) is chosen as an estimate of the correct tree. However, the
construction of a minimum evolution tree is time-consuming because, in principle, the S values for all topologies must
be evaluated. The number of possible topologies (unrooted trees) rapidly increases with the number of taxa so it
becomes very difficult to examine all topologies. In this case, one may use the neighbor-joining method. While the
137
NJ tree is usually the same as the ME tree, when the number of taxa is small the difference between the NJ and ME
trees can be substantial (reviewed in Nei and Kumar 2000). In this case if a long DNA or amino acid sequence is
used, the ME tree is preferable. When the number of nucleotides or amino acids used is relatively small, the NJ
method generates the correct topology more often than does the ME method (Nei et al. 1998, Takahashi and Nei
2000). In MEGA, we have provided the close-neighbor-interchange search to examine the neighborhood of the NJ tree
to find the potential ME tree.

Analysis Options

Analysis Preferences (Minimum Evolution)


In this dialog box you can select and view desired options in the Options Summary. Options are organized in logical
sections. A yellow row indicates that you have a choice for that particular attribute. The primary sets of options
available in this dialog box are:
Analysis
Test of Phylogeny
To assess the reliability of a phylogenetic tree, MEGA provides two different types of tests: the Bootstrap
test and the Interior branchtest. Both of these tests use the bootstrap re-sampling strategy, so you need to
enter the number of replicates. For a given data set applicable tests and the phylogeny inference method are
enabled.
Substitution Model
In this set of options, you can choose various attributes of the substitution models for DNA and protein
sequences.
Substitutions Type
Here you may select a substitutions type of Nucleotide, Syn-Nonsynonymous, or Amino Acid. The selection
in this row affects the available models in the model row.
Model
Here you select a stochastic model for estimating evolutionary distance by clicking on the row then selecting a
model for the current Substitutions Type.
Substitutions to Include
Depending on the distance model or method selected, the evolutionary distance can be teased into two or more
components. By clicking on the row, you will be provided with a list of components relevant to the chosen
model.
Transition/Transversion Ratio
This option will be visible if the chosen model requires you to provide a value for the Transition/Transversion
ratio (R).
Pattern among Lineages
This option becomes available if the selected model has formulas that allow the relaxation of the assumption
of homogeneity of substitution patterns among lineages.
Rates among Sites
This option becomes available if the selected distance model has formulas that allow rate variation among
sites. If you choose gamma-distributed rates, then the Gamma parameter option becomes visible.
Include Sites
These are options for handling gaps and missing data, including or excluding codon positions, and restricting
the analysis to labeled sites, if applicable.
Gaps and Missing Data
You may choose to remove all sites containing alignment gaps and missing information before the calculation
begins (Complete-deletionoption). Alternatively, you may choose to retain all such sites initially, excluding
them as necessary in the pairwise distance estimation (Pairwise-deletion option), or you may use Partial
Deletion (Site coverage) as a percentage.
Codon Positions
Check or uncheck the boxes for any combination of 1st, 2nd, 3rd, and non-coding positions for analysis. This
option is available only if the nucleotide sequences contain protein-coding regions and you have selected a
nucleotide-by-nucleotide analysis.
138
Labeled Sites
This option is available only if some or all of the sites have associated labels. By clicking on the row, you
will be provided with the option of including sites with selected labels. If you choose to include only labeled
sites, then these sites will be the first extracted from the data. Then all other options mentioned above will be
enforced. Note that labels associated with all three positions in the codon must be included for a full codon to
be incorporated in the analysis.

Tree Inference Options


ME Heuristic Method
MEGA employs the Close-Neighbor-Interchange (CNI) algorithm for finding the ME tree. It is
a branch swapping method, which begins with an initial NJ tree.
Initial Tree For ME
This is obtained by using Neighbor Joining.
ME Search Level
Select a search level 1 or 2.

HEURISTIC SEARCH

Close-Neighbor-Interchange (CNI)
In any method, examining all possible topologies is very time consuming. This algorithm reduces the time spent
searching by first producing a temporary tree, (e.g., an NJ tree when an ME tree is being sought), and then examining
all of the topologies that are different from this temporary tree by a topological distance of dT = 2 and 4. If this is
repeated many times, and all the topologies previously examined are avoided, one can usually obtain the tree being
sought.
For the MP method, the CNI search can start with a tree generated by the random addition of sequences. This process
can be repeated multiple times to find the MP tree.
See Nei & Kumar (2000) for details.

MAXIMUM PARSIMONY (MP) METHOD

Branch-and-Bound algorithm
The branch-and-bound algorithm is used to find all the MP trees. It guarantees to find all the MP trees without
conducting an exhaustive search. MEGA also employs the Max-mini branch-and-bound search, which is described in
detail in Kumar et al. (1993) and Nei and Kumar (2000, page 123).

Alignment Gaps and Sites with Missing Information


In MEGA, gap sites are ignored in the MP analysis, but there are three different ways to treat these sites. One is to
delete all of these sites from data analysis. This option, called the Complete-Deletion option, is generally desirable
because different regions of DNA or amino acid sequences often evolve under different evolutionary forces. However,
if the number of nucleotides (or amino acids) involved in a gap is small and gaps are distributed more or less
randomly, you may include all such sites and treat them as missing data. Therefore, gaps and missing data are never
used in computing tree lengths. The final option is Partial Deletion which deletes the gaps assuming there are less
than a certain percentage of gaps (unambiguous).

Consensus Tree

139
The MP method produces many equally parsimonious trees. Choosing this command produces a composite tree that
is a consensus among all such trees, for example, either as a strict consensus, in which all conflicting branching
patterns among the trees are resolved by making those nodes multifurcating or as a Majority-Rule consensus, in which
conflicting branching patterns are resolved by selecting the pattern seen in more than 50% of the trees.
(Details are given in Nei and Kumar [2000], page 130).

Analysis Options: Analysis Preferences (Maximum Parsimony)


This dialog box contains a set of analysis options for use in Maximum Parsimony analysis. Information from this
dialog is used in the requested analysis, so it is important that you examine the options selected before pressing OK to
proceed with an analysis.
Phylogeny Test and Options
To assess the reliability of the MP trees, MEGA provides the bootstrap test. You need to enter the number of
replicates for this test if it is selected.
Tree Inference Options
Use this to select between the Subtree-Pruning-Regrafting (SPR), Tree-Bisection-Reconnection (TBR), Min-
Mini, and Max-Mini Branch-and-Bound, heuristic search methods.
For the SPR and TBR methods, you may automatically obtain a set of initial trees by using the random
addition option to produce the initial trees.
For all methods except for Max-Mini Branch-and-Bound, the MP search level can be set.
For the branch-and-bound search, an optimized Max-Mini Branch-and-Bound algorithm is used. While this
algorithm is guaranteed to find all the MP trees, a branch-and-bound search often is too time-consuming for
more than 15 sequences, although this number varies from data set to data set.
For all methods, you can set the maximum number of equally parsimonious trees to retain.
Substitution Model
Substitution Type
For protein-coding data, the analysis can be performed at the nucleotide or the amino acid level.
Genetic Code Table
If protein-coding data is to be analyzed at the amino-acid level, a genetic code for translating the data must be
selected.
Data Subset to Use
This provides options for handling gaps and missing data in the analysis, specifying inclusion and exclusion
of codon positions, and restricting the analysis to only some types of labeled sites (if applicable).
Gaps and Missing Data
You may choose to remove all sites containing alignment gaps and missing-information before the parsimony
analysis begins using the Complete-deletion option. Alternatively, you may choose to retain all such sites. In
this case, all missing-information and alignment gap sites are treated as missing data in the calculation of tree
length. Your last option is Partial Deletion a.k.a. coverage where you may select a percentage where only
sites above that percentage of unambiguity will be counted.
Select Codon Positions
You may select any combination of 1st, 2nd, 3rd, and non-coding positions for analysis. This option is
available only if the nucleotide sequences contain protein-coding regions. If they do, you can choose between
the analysis of nucleotide sequences or translated protein sequences. If you choose the latter, MEGA will
translate all protein-coding regions into amino acid sequences and conduct the protein sequence parsimony
analysis.
Labeled Sites
This option is available only if there are labels associated with some or all of the sites in the data. By clicking
on the ellipses, you will have the option of including sites with selected labels. If you choose to include
only labeled sites, then these sites will be the first extracted from the data and all other options mentioned
above will be enforced. Note that labels associated with all three positions in the codon must be included for a
full codon to be incorporated in the analysis.

HEURISTIC SEARCH

Min-mini algorithm
140
This is a heuristic search algorithm for finding the MP tree, and is somewhat similar to the branch-and bound search
method. However, in this algorithm, many trees that are unlikely to have a small local tree length are eliminated from
the computation of their L values. Thus while the algorithm speeds up the search for the MP tree, as compared to the
branch-and-bound search, the final tree or trees may not be the true MP tree(s). The user can specify a search factor to
control the extensiveness of the search and MEGA adds the user specified search factor to the current local upper
bound. Of course, the larger the search factor, the slower the search, since many more trees will be examined.
(See also Nei & Kumar (2000), pages 122, 125)

Tree Bisection Reconnection (TBR)


Tree Bisection and Reconnection is a search heuristic which reduces the number of topologies searched, so that we
don’t have to perform an exhaustive search of all possible tree topologies.
TBR selects a subtree, and deletes the internal branch which connects it to the main tree. This subtree is now not
connected to the main tree, which leaves us with 2 trees. The subtree is then reconnected to the main tree in all
possible connections between the branches of the two trees. If any of the reconnected trees score better than the
currently best tree, the algorithm selects that tree as the new best tree and performs another round of bisection,
reconnection, and scoring.

Analysis Options:

Analysis Preferences (Maximum Likelihood)


Analysis Preferences (ML)
In this dialog box, you can view and select desired settings for options which are listed in the Options Summary tab.
Available options are organized into logical sections. A yellow row indicates an analysis option for which there are
multiple settings available. To view the available settings for an option, click on its row. The five primary categories
of options available for ML analysis are:
Phylogeny Test
To assess the reliability of a phylogenetic tree, MEGA provides the Bootstrap test. When selected, this test
requires you to enter the desired number of replicates.
Substitution Model
In this set of options, you can select various attributes of the substitution models for DNA and protein
sequences.
Substitutions Type
Here you may select a substitution type of either Nucleotide or Amino Acid, depending on the active sequence
data. The selection in this row determines the models of evolution that will be available in
the Model/Method row.
Genetic Code Table (displayed for protein coding nucleotide data only)
By accessing this option, you can select from a list of genetic code tables for automatically translating codons
into amino acids. Alternatively, you can open a Code Table Editor from within this option.
Model/Method
Here you can select a stochastic model for estimating evolutionary distance by clicking on the row then
selecting from one of the models that are applicable for the currently selected Substitutions Type.
Rates and Patterns
Rates among Sites
This option will be enabled if the selected model has formulas that allow rate variation among sites. If you
choose gamma-distributed rates, then the No of Discrete Gamma Categories option becomes visible
(see Gamma parameter).
Data Subset to Use
These are options for handling gaps and missing data, including or excluding codon positions, and restricting
the analysis to labeled sites, if applicable.
Gaps/Missing Data Treatment
For this analysis setting, you may choose to remove all sites containing alignment gaps and missing
information before the calculation begins (Complete-deletion option). Alternatively, you may choose to retain
141
all such sites initially, excluding them as necessary in the pairwise distance estimation (Pairwise-
deletion option), or you may use Partial-Deletion (Site Coverage Cutoff ) as a percentage.
Codon Positions
Check or uncheck the boxes for any combination of 1st, 2nd, 3rd, and non-coding positions for analysis. This
option is available only if you are analyzing nucleotide sequences which contain protein-coding regions and
you have selected a nucleotide-by-nucleotide analysis.
Labeled Sites
This option is available only if some or all of the sites have associated labels. By clicking on the row, you will
be provided with the option of including sites with selected labels. If you choose to include only labeled sites,
then these sites will be the first extracted from the data. Then all other options mentioned above will be
enforced. Note that labels associated with all three positions in the codon must be included for a full codon to
be incorporated in the analysis.
Tree Inference Options
ML Heuristic Method
This option allows you to specify either the Nearest-Neighbor-Interchange (NNI) or Subtree-Pruning-
Regrafting (SPR) tree searching strategy.

HEURISTIC SEARCH

Nearest-Neighbor-Interchange (NNI)
The Nearest-Neighbor-Interchange is a heuristic to improve the likelihood of a tree by performing the following
operation on it. If we have two unrooted trees then we can specify a neighbor relation between the two of them, and
then swap their subtrees in an attempt to get a tree which has a higher likelihood.

Subtree-Pruning-Regrafting (SPR)
For any tree searching method, exhaustive search, where all possible topologies are considered is unfeasible for even a
small number of taxa. Subtree Pruning And Regrafting is a tree topology search heuristic which reduces the number of
topologies searched by performing the following operations on the tree.
First, a subtree of the current best tree is selected and detached (pruned). Second, the detached subtree is regrafted
onto another branch of the remaining tree, in such a way that a new topology is created and then likelihood of the
new topology is calculated. This procedure is repeated for all regrafting positions that produce new topologies using
the pruned subtree. The procedure is also repeated for each subtree (within the designated search level) and if
the topology with best likelihood among those scored gives sufficient improvement over the current best tree,
that topology becomes the current best tree. This is repeated until no significant further likelihood improvements are
obtained.
A single pass of the SPR algorithm examines O(N2) new trees, where N is the number of leaves in the original tree.
This is because, for each subtree there are O(N) possible regraftings, and there are O(N) possible subtrees to consider.
In contrast, NNI examines O(N) topologies at each pass of the algorithm.

STATISTICAL TESTS OF A TREE OBTAINED

General Comments on Statistical Tests


There are two different types of methods for testing the reliability of an obtained tree. One is to test the topological
difference between the tree and its closely related tree by using a certain quantity, for example, the sum of
all branch lengths in the minimum evolution method. This type of test examines the reliability of every
interior branch of the tree, and is generally a conservative test as compared to other tests included in MEGA.
The other type of test examines the reliability of each interior branch whether or not it is significantly different from
0. If a particular interior branch is not significantly different from 0, we cannot exclude the possibility of a
trifurcation of the associated branches or that the other types of bifurcating trees can be generated by changing the

142
splitting order of the three branches involved. Therefore, in MEGA we implement the bootstrap procedure for
estimating the standard error of the interior branch and test the deviation of the branch length from 0 (Dopazo 1994).
The third type of test is the bootstrap test, in which the reliability of a given branch pattern is ascertained by
examining the frequency of its occurrence in a large number of trees, each based on the resampled dataset.
Details of these procedures are given in Nei and Kumar (2000, chapter 9).

Condensed Trees
When several interior branches of a phylogenetic tree have low statistical support (PC or PB) values, it often is useful
to produce a multifurcating tree by assuming that all interior branches have a branch length equal to 0. We call this
multifurcating tree a condensed tree. In MEGA, condensed trees can be produced for any level of PC or PB value. For
example, if there are several branches with PC or PB values of less than 50%, a condensed tree with the
50% PC or PB level will have a multifurcating tree with all its branch lengths reduced to 0.
Since branches of low significance are eliminated to form a condensed tree, this tree emphasizes the reliable portions
of branching patterns. However, this tree has one drawback. Since some branches are reduced to 0, it is difficult to
draw a tree with proper branch lengths for the remaining portion. Therefore we give our attention only to
the topology so the branch lengths of a condensed tree in MEGA are not proportional to the number of nucleotide or
amino acid substitutions.
Note that, although they may look similar, condensed trees are different from the consensus trees mentioned earlier.
A consensus tree is produced from many equally parsimonious trees, whereas a condensed tree is merely a simplified
version of a tree. A condensed tree can be produced for any type of tree (NJ, ME, UPGMA, MP, or maximum-
likelihood tree).
See also Nei and Kumar (2000) page 175.

Interior Branch Tests:

Interior Branch Test of Phylogeny

Phylogeny | Construct/Test Neighbor-Joining Tree


Or
Phylogeny | Construct/Test Minimum-Evolution Tree

A t-test, which is computed using the bootstrap procedure, is constructed based on the interior branch length and its
standard error and is available only for the NJ and Minimum Evolution trees. MEGA shows the confidence
probability in the Tree Explorer; if this value is greater than 95% for a given branch, then the inferred length for that
branch is considered significantly positive. Select test of phylogeny for either of these trees in the Analysis
Preferences dialog.

See Nei and Kumar (2000) (chapter 9) for further details.

Testing Neighbour Joining Tree: Neighbor Joining (Construct Phylogeny)

Phylogeny | Construct/Test Neighbor-Joining Tree…


This command is used to construct (or Test) a neighbor-joining (NJ) tree (Saitou & Nei 1987). The NJ method is a
simplified version of the minimum evolution (ME) method, which uses distance measures to correct for multiple hits
at the same sites, and chooses a topology showing the smallest value of the sum of all branches as an estimate of the
correct tree. However, the construction of an ME tree is time-consuming because, in principle, the S values for all
topologies have to be evaluated and the number of possible topologies (unrooted trees) rapidly increases with the
number of taxa.
With the NJ method, the S value is not computed for all or many topologies. The examination of different topologies
is imbedded in the algorithm, so that only one final tree is produced. This method does not require the assumption of a
constant rate of evolution so it produces an unrooted tree. However, for ease of inspection, MEGA displays NJ trees
in a manner similar to rooted trees. The algorithm of the NJ method is somewhat complicated and is explained in
detail in Nei and Kumar (2000).
143
For constructing the NJ tree, MEGA may request that you specify the distance estimation method, subset of sites to
include, and whether to conduct a test of the inferred tree through an Analysis Preferences dialog box.

BOOTSTRAP TESTS:

Bootstrap Test of Phylogeny

Phylogeny | Construct/Test Neighbor-Joining Tree


Or
Phylogeny | Construct/Test Minimum-Evolution Tree
Or
Phylogeny | Construct/Test UPGMA Tree
Or
Phylogeny | Construct/Test Maximum Likelihood Tree
Or
Phylogeny | Construct/Test Maximum Parsimony Tree(s)

One of the most commonly used tests of the reliability of an inferred tree is Felsenstein's (1985) bootstrap test, which
is evaluated using Efron's (1982) bootstrap resampling technique. If there are m sequences, each with n nucleotides
(or codons or amino acids), a phylogenetic tree can be reconstructed using some tree building method. From each
sequence, n nucleotides are randomly chosen with replacements, giving rise to m rows of n columns each. These now
constitute a new set of sequences. A tree is then reconstructed with these new sequences using the same tree building
method as before. Next the topology of this tree is compared to that of the original tree. Each interior branch of the
original tree that is different from the bootstrap tree the sequence it partitions is given a score of 0; all other interior
branches are given the value 1. This procedure of resampling the sites and the subsequent tree reconstruction is
repeated several hundred times, and the percentage of times each interior branch is given a value of 1 is noted. This is
known as the bootstrap value. As a general rule, if the bootstrap value for a given interior branch is 95% or higher,
then the topology at that branch is considered "correct". See Nei and Kumar (2000) (chapter 9) for further details.
This test is available for four different methods: Neighbor Joining, Minimum Evolution, Maximum
Parsimony, UPGMA, and Maximum Likelihood.

Bootstrap method to compute standard error of distance estimates

When you choose the bootstrap method for estimating the standard error, you must specify the number of replicates
and the seed for the pseudorandom number generator. In each bootstrap replicate, the desired quantity is estimated
and the standard deviation of the original values is computed (see Nei and Kumar [2000], page 25 for details).
It is possible that in some bootstrap replicates the quantity you desire is not calculable for statistical or technical
reasons. In these cases, MEGA will discard the results of the bootstrap replicates and its final estimate will be the
results of all valid replicates. This means that the number of bootstrap replicates used can be smaller than the number
specified by the user. However, if the number of valid bootstrap replicates is < 25, then MEGA will report that the
standard error cannot be computed (an “n/c” swill appear in the result window).

MOLECULAR CLOCK TEST

Tajima's Test (Relative Rate)

Molecular Clocks | Tajima’s Relative Rate Test


Use this to conduct Tajima’s relative rate test (Tajima 1993), which works in the following way. Consider three
sequences, 1, 2 and 3, and let 3 be the out-group. Let nijk be the observed number of sites in which sequences 1, 2 and
3 have nucleotides i, j and k. Under the molecular clock hypothesis, E(nijk) = E(njik) irrespective of the substitution
model and whether or not the substitution rate varies with the site. If this hypothesis is rejected, then the molecular
clock hypothesis can be rejected for this set of sequences.

144
In response to this command, you can select the three sequences for conducting Tajima’s test. For nucleotide
sequences, this test offers the flexibility of using only transitions, only transversions, or both. If the data is protein
coding, then you can choose to analyze translated sequences or any combination of codon positions by clicking on the
‘Data for Analysis’ button.

See Nei and Kumar (2000) (page 193-196) for further description and an example.

Molecular Clock Test (ML)

Clocks | Test Molecular Clock(ML)

This option performs a Maximum Likelihood test of the molecular clock hypothesis for a given tree topology and
sequence alignment. (The “Molecular Clock Hypothesis” means that all tips of the tree are equidistant from the root
of the tree.) two log-likelihood values are calculated and displayed, one with and one without the clock
hypothesis. The latter will always be larger (note that the numbers are negative, so “larger” means “smaller in
absolute value”). The statistical significance of the difference may be tested by comparing twice the difference in log-
likelihood values to a chi-squared threshold value with s-2 degrees of freedom, where s is the number of sequences in
the alignment.

ANCESTRAL STATES

Inferring Ancestral Sequences (ML)

Ancestral Sequences | Infer Ancestral Sequences (ML)


This option uses the Maximum Likelihood method to estimate the ancestral state of each node in a phylogenetic
tree. The state is chosen to be the one that maximizes the probability of the given sequence data under the selected
model of nucleotide or amino acid evolution. Inferring ancestral sequences using ML, on average, gives more
accurate results than using Maximum Parsimony, especially when the phylogenetic tree includes long branches.

Inferring Ancestral Sequences (MP)

Ancestral Sequences | Infer Ancestral Sequences (Parsimony)


When the sequence diversity is low Maximum Parsimony is effective at inferring the ancestral sequences. In the case
that your sequences are somewhat distant MP may produce several possible sequences, and finding the most probable
one can sometimes be difficult.

CONSTRUCTING TIME TREES

Time Trees
Time Trees can be computed in MEGA where divergence times are estimated for all branching points in a tree using
the RelTime method (RelTime is described in Tamura et al. 2012) which does not require assumptions for lineage
rate variations. The implementation in MEGA is very fast and expands on the RelTime method so that multiple
calibration constraints can be provided, in which case MEGA will produce absolute divergence times along with
relative divergence times while respecting the provided constraints. Additionally, the implementation in MEGA can
compute divergence times without calibration constraints, in which case, only relative times will be produced.
There are several types of calibrations that can be used in MEGA:
Calibration densities:
Statistical densities that provide prior belief about the possible location of the true species divergence time
relative to the minimum and/or maximum constraints can be used. When using this option, each calibration
density is transformed into a pair of discrete constraints such that the minimum bound is placed at 2.5% of the
145
density age and the maximum bound at the 97.5% of the density age . This means that the minimum and
maximum bounds will cover 95% of the total probability density. Three statistical distribution can be used for
calibration densities in MEGA:
Normal - requires that a mean and standard deviation be provided and minimum and maximum constraints will
be derived from the distribution. For instance, a calibration density using a normal distribution with mean=10
and stddev=1 will produce a constraint where minTime=8.04 and maxTime=11.96
Exponential - requires that a divergence time and decay are provided and a minimum constraint will be derived
from the distribution. For instance, a calibration density using an exponential distribution with mean=10 and
decay=0.25 will produce a constraint where minTime=9.9
Uniform - requires that a minTime and maxTime be provided and will produce a constraint whose minTime
and maxTime are those provided.
Lognormal - requires that 3 parameters are provided: offset, mean, and stddev and minimum and maximum
constraints will be derived from the distribution. For instance, a calibration density using a lognormal
distribution with offset=7, mean=1.5, and stddev=0.15 will produce a constraint where minTime=10.34 and
maxTime=13.01
Minimum Times:
Sets a hard minimum divergence time constraint on the target node.
Maximum Times:
Sets a hard maximum divergence time constraint on the target node.
Fixed Times:
The divergence time for the target node will be equal to the provided fixed time.
Fixed Rate:
This option will define a global evolutionary rate r (in units of substitutions per site per year) that is used
throughout the tree. For every node in the tree whose height (in units of substitutions per site) is h, the
divergence time of the node will be set to h/r.
Tip Dates (sample times):
This option is only used for the RTDT (RelTime with Dated Tips) method. In this case, the tip dates are the
dates at which molecular sequences were sampled. This method is suitable for the analysis of DNA and protein
sequences from fast evolving pathogens and those generated from ancient times.
See also
Time Tree Tutorial
Molecular Clock Test (ML)
Calibration Times Editor

Calibration Dialog
The Calibration Editor allows you to define multiple divergence time calibration constraints which will be used for
the RelTime analyses.
When you select a RelTime analysis, the Timetree Wizard will be displayed and you will first be prompted to provide
an alignment file (if one is not already activated) and a tree file which gives the topology for the time tree, and then
specify an outgroup in the tree to place the root on. Next, you will have the option to specify divergence time
calibration constraints. If you select this option, the Calibration Editor will be displayed and it can be used in specify
divergence time calibration constraints.
Specifying Constraints
To specify a calibration constraint using the Calibration Editor, select a constraint type (see overview) from
the Calibration | Calibrate MRCA menu. This will create a new calibration constraint and a dialog asking for
constraint parameters will be shown. After entering the constraint parameters, you specify a node for which the new
calibration applies by selecting two taxa from the Taxon A and Taxon B dropdown lists whose most recent common
ancestor (MRCA) is the node to apply the constraint to. Next, edit the calibration name if the default name is not
satisfactory. You may also edit the node label name (optional) in the MRCA Node Label edit box. This node label is
useful for interpreting the tabular Time Tree output produced by MEGA’s Time Tree system so that you can quickly
identify calibrated nodes by name instead of node number. You can edit the selected constraint's values at any time by
clicking the edit button that is next to the constraint value(s). You can provide min and max times, just a min time, or
just a max time (as long as at least one min time and one max time are provided among all constraints).

146
To specify a calibration constraint using the tree display, select an internal node in the tree and then select a constraint
type from the Calibration | Calibrate Selected Node menu. When you launch this action, a new calibration constraint
will be created in the Calibration Editor with two taxa already selected from the Taxon A and Taxon B dropdown
lists. Then you can finish providing calibration parameters as described above.
Once you are finished specifying constraints, you can save your changes by clicking the OK button. This will advance
the Timetree Wizard to the next step.

Importing and Exporting


Calibration constraints can be imported from a text file or saved to a text file using the Import and Export features.
Calibration constraints imported from text files must be formatted according to a set of rules which are described here.
When you export a set of calibration constraints, they are saved in a format that can be imported into the Calibration
Editor and also be used by the command-line version of MEGA (MEGA-CC)Also, once a time tree analysis is
complete,

Time Tree Tool

The Timetree tool in the Tree Explorer is used for calculating relative and absolute divergence times for all
branching points in the tree. Using the Timetree tool will produce a time tree with the same topology as the
active tree, where MEGA estimates local clock rates and divergence times for all branching points in the tree
using the RelTime (see Tamura et al. 2012) method. When using this tool, all divergence time estimates are
based solely on the branch lengths in the active tree (MEGA provides options to pre-compute branch lengths
(e.g. using the likelihood-based tool) from the Clocks menu on the main MEGA form).
To use the Timetree Tool in Tree Explorer, select Compute | Compute Time Tree (or click the Time Tree
Tool button which looks like a clock). The Timetree Wizard, which specifies the steps for creating a
timetree, will then be displayed.
Once the Time Tree tool is finished, estimated divergence times and local clock rates can be exported to a
text file (File | Export Current Tree (Time Tree)) or viewed in the information window (File | Show
Information).
See also
Time Trees
Time Tree (ML) tutorial
Molecular Clock Test

Calibration File Format

The calibration file is used to provide divergence time calibration constraints to MEGA so that MEGA
can convert relative divergence time estimates into absolute divergence times while respecting the
given constraints.
There are three valid formats for providing calibration values in this file:

!NodeName=’some name’ minTime=1.75 maxTime=2.25


With this format, the NodeName value must match an internal node label in the Newick
file being evaluated.

!MRCA=’some name’ TaxonA='taxon A name' TaxonB='taxon B name' minTime=1.75


maxTime=2.25
The value for MRCA is an internal node label that will be assigned to the target node. If a
label for that target node is also supplied in the input Newick file, the label in the Newick file
will be ignored. The values for TaxonA and TaxonB specify two leaf nodes whose most recent
common ancestor in the active phylogeny is the calibration target node.
147
!MRCA='demoLabel1' TaxonA=chimpanzee TaxonB=bonobo Distribution=normal mean=6.4
stddev=1.2
!NodeName='demoLabel2' Distribution=exponential time=8.2 decay=0.25
!MRCA='orangutan-sumatran' TaxonA=orangutan TaxonB=sumatran Distribution=uniform mintime=4
maxtime=6

!MRCA='orangutan-sumatran' TaxonA=orangutan TaxonB=sumatran Distribution=lognormal


offset=7.0 mean=2.38 stddev=0.15
The four examples above specify statistical distributions to be used as calibration densities (can
be one of normal, lognormal, exponential, or uniform). When using this format, each calibration
density will be transformed into a pair of discrete constraints such that the minimum bound is placed at
2.5% of the density age and the maximum bound at the 97.5% of the density age. For instances, a
normal distribution with mean equal to 10 and stddev equal to 1 would result in a constrain with
minTime=8.04 and maxTime=11.96

Note*** When specifying an exponential distribution, one can use the keywords offset and lambda in
place of time and decay respectively.

A single fixed time may be provided and for the RTDT analysis, this format should be used. For
example:
!NodeName=’some name’ time=2007

Optionally, a calibration can be given a name as the last parameter (e.g.


!MRCA…calibrationName=‘myCalib’). Multiple calibrations may be provided, in which case, MEGA
will generate absolute divergence times for all branching points in the active phylogeny while
respecting all of the provided constraints. Each calibration must be on a single line and only one
calibration can be provided per line. Taxa names must match those in the input Newick file. If a taxon
name contains whitespace, the name must be placed in single quotes. Single quotes are not allowed
inside taxa names.
The following are examples of valid calibration constraints:

!MRCA='ch-bo' TaxonA='chimpanzee' TaxonB='bonobo' MinTime=0.8 MaxTime=5.0;

!NodeName='myNode' time=6.3 calibrationName=’myCalib’;

!NodeName='gorilla_human' MinTime=3.7 calibrationName='gorilla human';

!MRCA='orangutan sumatran' TaxonA='orangutan' TaxonB='sumatran' MaxTime=11.0;


!MRCA='demoLabel1' TaxonA=chimpanzee TaxonB=bonobo Distribution=normal
mean=6.4 stddev=1.2
!NodeName='demoLabel2' Distribution=exponential time=8.2 decay=0.25;
!MRCA='orangutan-sumatran' TaxonA=orangutan TaxonB=sumatran
Distribution=uniform mintime=4 maxtime=6;

148
!MRCA='orangutan-sumatran' TaxonA=orangutan TaxonB=sumatran
Distribution=lognormal offset=7.0 mean=2.38 stddev=0.15

Calibrate Timetree with Molecular Clock


MEGA provides two options for calibrating the molecular clock for RelTime trees (where no calibration constraints
have been used):

Use a single fixed time of divergence - in this case, the clock calibration is simply the (relative) height of the target
node divided by fixed the time of divergence, and then this calibrated clock rate sets the scale to convert all relative
times in the tree into absolute times.

Use a fixed evolutionary rate - using this option one can define a global evolutionary rate r (in units of substitutions
per site per year) that is used throughout the tree. For every node in the tree whose relative height (in units of
substitutions per site) is h, the divergence time of the node will be set to h/r.
Both of these options are only available from the Tree Explorer window when displaying a RelTime tree has been
generated without using calibration constraints. To access this utility in the Tree Explorer window, select an internal
node or a branch to focus on in the tree and click Compute | Calibrate Molecular Clock. The molecular clock dialog
will then be shown and if a branch in the tree is focused, the option for using a fixed evolutionary rate will be enabled.
If a node in the tree is focused, the option to use a fixed time of divergence will be enabled (in which case you can
also set the scale bar title using the Time Unit edit box in this dialog).

Autocorrelated Rates Test


When using Bayesian methods for estimating times of divergence in a phylogenetic tree, the selection of branch rate
model is an important consideration. Both the independent branch rate (IBR) model and the autocorrelated branch rate
(ABR) are widely used but can result in significantly different time estimates (Battistuzzi et al. 2010; Christin et al.
2014; Dos Reis et al. 2014, 2015; Foster et al. 2016; Liu et al. 2017; Pacheco et al. 2018; Takezaki 2018). In MEGA
we have now added the Corrtest method (Tao et al. 2019) for detecting autocorrelation of evolutionary rates in large
phylogenies. The Corrtest method, which uses a predictive model that was developed using machine learning
techniques, tests the null hypothesis of independence of evolutionary rates among lineages. The Corrtest method
requires a phylogeny with branch lengths as well as the specification of an outgroup and the implementation in MEGA
adds the ability to calculate the required branch lengths using the maximum likelihood (ML) method. The Corrtest
implementation in the MEGA GUI uses the same wizard system that is used for the EP method as described here. The
final output for the Corrtest is a P value to decide whether or not the IBR model should be rejected and a correlation
score (CorrScore). The CorrScore is a value between 0 and 1 where a high value indicates that the branch rates are
autocorrelated. The CorrTest method is an accurate and computationally efficient alternative to the Bayes factor (BF)
approach to detect autocorrelation of evolutionary rates among lineages

HANDLING MISSING DATA AND ALIGNMENT GAPS

Gaps In Distance Estimation


Gaps often are inserted during the alignment of homologous regions of sequences and represent deletions or
insertions (indels). They introduce some complications in distance estimation. Furthermore, sites with missing
information sometimes result from experimental difficulties; they present the same alignment problems as gaps. In
the following discussion, both of these situations are treated in the same way.
In MEGA, there are two ways to treat gaps. One is to delete all of these sites from the data analysis. This option,
called the Complete-Deletion, is generally desirable because different regions of DNA or amino acid sequences
evolve under different evolutionary forces. The second method is relevant if the number of nucleotides involved in
a gap is small and if the gaps are distributed more or less randomly. In that case it may be possible to compute a
distance for each pair of sequences, ignoring only those gaps that are involved in the comparison; this option is
calledPairwise-Deletion. The following table illustrates the effect of these options on distance estimation with the
following three sequences:
149
1 10 20
seq1 A-AC-GGAT-AGGA-ATAAA
seq2 AT-CC?GATAA?GAAAAC-A
seq3 ATTCC-GA?TACGATA-AGA Total sites = 20.
Here, the alignment gaps are indicated with a hyphen (-) and the missing information sites are denoted by a
question mark (?).
Complete-Deletion and Pairwise-Deletion options
Differences/Comparisons
Option Sequence Data (1,2) (1,3) (2,3)
Complete deletion 1. A C GA A GA A A A 1/10 0/10 1/10
2. A C GA A GA A C A
3. A C GA A GA A A A
Pairwise Deletion 1. A-AC-GGAT-AGGA-ATAAA 2/12 3/13 3/14
2. AT-CC?GATAA?GAAAAC-A
3. ATTCC-GA?TACGATA-AGA
In the above table, the number of compared sites varies with pairwise comparisons in the Pairwise-Deletion option,
but remains the same for pairwise comparisons in the Complete-Deletion option. In this data set, more information
can be obtained by using the Pairwise-Deletion option. In practice, however, different regions of nucleotide or
amino acid sequences often evolve differently, in which case, the Complete-Deletionoption is preferable.

Alignment Gaps and Sites with Missing Information


In MEGA, gap sites are ignored in the MP analysis, but there are three different ways to treat these sites. One is to
delete all of these sites from data analysis. This option, called the Complete-Deletion option, is generally desirable
because different regions of DNA or amino acid sequences often evolve under different evolutionary forces. However,
if the number of nucleotides (or amino acids) involved in a gap is small and gaps are distributed more or less
randomly, you may include all such sites and treat them as missing data. Therefore, gaps and missing data are never
used in computing tree lengths. The final option is Partial Deletion which deletes the gaps assuming there are less
than a certain percentage of gaps (unambiguous).

Include Sites Option


With this command you can set the options for handling gaps and missing data in the analysis, such as including or
excluding codon positions, and restricting the analysis to only some types of labeled sites, if applicable.
Gaps and Missing Data
You may choose to remove all sites containing alignment gaps and missing information before the parsimony analysis
begins (Complete-deletionoption). Alternatively, you may choose to retain all such sites. In this case, all missing-
information and alignment gap sites are treated as missing data in the calculation of tree length. The third option
is Partial Deletion (Site coverage) as a percentage of unambiguous data (if there is less unambiguous data than the %
specified then it gets deleted).
Codon Positions
Check or uncheck the boxes to select any combination of 1st, 2nd, 3rd, and non-coding positions for analysis. This
option is available only if the nucleotide sequences contain protein-coding regions. If it does, you can choose between
the analysis of nucleotide sequences or translated protein sequences. If the latter is chosen, MEGA will translate all
protein-coding regions into amino acid sequences and conduct the protein sequence parsimony analysis.
Labeled Sites
This option is available only if you have labels associated with some or all of the sites in the data. By clicking on the
row, you will be provided with the option of including sites with selected labels. If you choose to include only labeled
sites, then these sites will be the first extracted from the data. Then all other options mentioned above will be

150
enforced. Note that labels associated with all three positions in the codon must be included for a full codon to be
incorporated in the analysis.

TEST OF SELECTION

Synonymous / Nonsynonymous Tests

Large Sample Tests of Selection


One way to test whether positive selection is operating on a gene is to compare the relative abundance of synonymous
and nonsynonymous substitutions that have occurred in the gene sequences. For a pair of sequences, this is done by
first estimating the number of synonymous substitutions per synonymous site (dS) and the number of nonsynonymous
substitutions per nonsynonymous site (dN), and their variances: Var(dS) and Var(dN), respectively. With this
information, we can test the null hypothesis that H0: dN = dS using a Z-test:
Z = (dN - dS) / SQRT(Var(dS) + Var(dN))
The level of significance at which the null hypothesis is rejected depends on the alternative hypothesis (HA).
H0: dN = dS
HA: (a) dN ¹ dS (test of neutrality).
(b) dN > dS (positive selection).
(c) dN < dS (purifying selection).

For alternative hypotheses (b) and (c), we use a one-tailed test and for (a) we use a two-tailed test. These three tests
can be conducted directly for pairs of sequences, overall sequences, or within groups of sequences. For testing for
selection in a pairwise manner, you can compute the variance of (dN - dS) by using either the analytical formulas or the
bootstrap resampling method.
For data sets containing more than two sequences, you can compute the average number of synonymous substitutions
and the average number of nonsynonymous substitutions to conduct a Z-test in manner similar to the one mentioned
above. The variance of the difference between these two quantities is estimated by the bootstrap method (Nei and
Kumar [2000], page 56).

Analysis Preferences (Z-test of Selection)


In this dialog box, you can view and select options in the Options Summary. Options are organized in logical
sections. A yellow row indicates that you have a choice for that particular attribute. The three primary sets of options
available in this dialog box are:
Analysis
Analysis Scope
Use this option to specify whether to conduct an analysis for sequence pairs, an overall average, or within
sequence groups (if sequence groups are specified).
Test Hypothesis
One way to test whether positive selection is operating on a gene is to compare the relative abundance of
synonymous and nonsynonymous substitutions within the gene sequences. For a pair of sequences, this is
done by first estimating the number of synonymous substitutions per synonymous site (dS) and the number of
nonsynonymous substitutions per nonsynonymous site (dN), and their variances: Var(dS) and Var(dN),
respectively. With this information, we can test the null hypothesis that H0: dN = dS using a Z-test:
Z = (dN - dS) / SQRT(Var(dS) + Var(dN))
The level of significance at which the null hypothesis is rejected depends on the alternative hypothesis (HA):
H0: dN = dS
HA: (a) dN ¹ dS (test of neutrality).
(b) dN > dS (positive selection).
(c ) dN < dS (purifying selection).

For alternative hypotheses (b) and (c), we use a one-tailed test and for (a) we use a two-tailed test. These three
tests can be conducted directly for pairs of sequences, overall sequences, or within groups of sequences. For

151
testing for selection in a pairwise manner, you can compute the variance of (dN - dS) by using either the
analytical formulas or the bootstrap resampling method.
For data sets containing more than two sequences, you can compute the average number of synonymous
substitutions and the average number of nonsynonymous substitutions to conduct a Z-test in a manner similar
to the one mentioned above. The variance of the difference between these two quantities can be estimated by
the bootstrap method (Nei and Kumar [2000], page 56).
Variance Estimation Method
Depending on the scope of the analysis (pairwise versus other), you may compute standard errors using
analytical formulas or the bootstrap method. Whenever standard errors are estimated by the bootstrap
method, you will be prompted for the number of bootstrap replicates and a random number seed.
When the selected test involves the computation of average distance, only the bootstrap method is available
for computing standard errors.
Substitution Model
In this set of options, you can choose various attributes of the substitution models for DNA and protein
sequences.
Substitutions Type
This is limited to Syn-Nonsynonymous.
Model
By clicking on the row of the currently selected model, you may select a stochastic model for estimating
evolutionary distance (click on the yellow row first). This will reveal a menu containing many different
distance methods and models.
Transition/Transversion Ratio
This option will be visible if the chosen model requires you to provide a value for the Transition/Transversion
ratio (R).
Data Subset to Use
These are options for handling gaps and missing data and restricting the analysis to labeled sites, if applicable.
Gaps and Missing Data
You may choose to remove all sites containing alignment gaps and missing information before the calculation
begins (Complete-deletionoption). Alternatively, you may choose to retain all such sites initially, excluding
them as necessary in the pairwise distance estimation (Pairwise-deletion option), or you may use Partial
Deletion (Site coverage) as a percentage.
Labeled Sites
This option is available only if there are labels associated with some or all of the sites in the data. By
clicking on the yellow row, you will have the option of including sites with selected labels. If you chose to
include only labeled sites, they will be first extracted from the data and all of the other options mentioned
above will be enforced. Note that labels associated with all three positions in the codon must be included for
a full codon in the analysis.

Analysis Preferences (Fisher's Exact Test)


When the numbers of codons or the total numbers of synonymous and/or nonsynonymous substitutions are small,
the large sample Z-test is too liberal in rejecting the null hypothesis. In these cases, tests of selection can be conducted
to examine the null hypothesis of the neutral evolution. Only the Nei-Gojobori and Modified Nei-Gojobori methods
can be used for this test because it requires the direct computation of the numbers of synonymous and nonsynonymous
differences, and the number of synonymous and nonsynonymous sites. It should be used only when sequences show a
small number of differences. To conduct Fisher’s Exact Test, you need to specify two specific options:
Substitution Model
In this set of options, you choose various attributes of the substitution models for DNA and protein sequences.
Substitutions Type
Here you are limited to Syn-Nonsyn.
Model

152
By clicking on the row currently selected model, you may select a stochastic model for estimating
evolutionary distance. This will reveal a menu containing two different options: the original or modified Nei
& Gojobori methods.
Transition/Transversion Ratio
This option will be visible if the chosen model requires you to provide a value for the Transition/Transversion
ratio (R).
Data Subset to Use
These options handle gaps and missing data and restrict the analysis to labeled sites, if applicable.
Gaps and Missing Data
You may choose to remove all sites containing alignment gaps and missing information before the calculation
begins (Complete-deletionoption). Alternatively, you may choose to retain all such sites initially, excluding
them as necessary in the pairwise distance estimation (Pairwise-deletion option), or you may use Partial
Deletion (Site coverage) as a percentage.
Labeled Sites
This option is available only if some or all of the sites have associated labels. By clicking on the row, you
will be provided with the option of including sites with selected labels. If you choose to include only labeled
sites, then these sites will be the first extracted from the data. Then all other options mentioned above will be
enforced. Note that labels associated with all three positions in the codon must be included for a full codon to
be incorporated in the analysis.

Analysis Preferences (Pattern Homogeneity Analysis)


In this dialog box, you can select and view options in the Options Summary for the different pattern homogeneity
analyses (Composition Distance , Disparity Index, and Disparity Index Test). Options are organized in logical sections
and a yellow row indicates that you have a choice for that particular attribute
No. of Monte-Carlo Repetitions
If the Disparity Index Test is selected, MEGA will conduct the Monte-Carlo analysis, for which you need to
provide the number of replicates and a starting random seed.
Data Subset to Use
These are options for handling gaps and missing data, including or excluding codon positions, and restricting
the analysis to labeled sites (if applicable).
Gaps and Missing Data
You may choose to remove all sites containing alignment gaps and missing information before the calculation
begins (Complete-deletionoption). Alternatively, you may choose to retain all such sites initially, excluding
them as necessary in the pairwise distance estimation (Pairwise-deletion option), or you may use Partial
Deletion (Site coverage) as a percentage.
Codon Positions
Check or uncheck the boxes to select any combination of 1st, 2nd, 3rd, and non-coding positions for
analysis. This option is available only if the nucleotide sequences contain protein-coding regions and you
have selected a nucleotide-by-nucleotide analysis. If they do, you also can choose between the analysis of
nucleotide sequences or translated protein sequences. If the latter is chosen, MEGA will translate all protein-
coding regions into amino acid sequences and conduct the protein sequence analysis.
Labeled Sites
This option is available only if some or all of the sites have associated labels. By clicking on the row, you
will be provided with the option of including sites with selected labels. If you choose to include only labeled
sites, then these sites will be the first extracted from the data. Then all other options mentioned above will be
enforced. Note that labels associated with all three positions in the codon must be included for a full codon to
be incorporated in the analysis.

OTHER TESTS
Tajima's Test of Neutrality
153
Selection | Tajima’s Test of Neutrality
This conducts Tajima’s test of neutrality (Tajima 1989), which compares the number of segregating sites per site with
the nucleotide diversity. (A site is considered segregating if, in a comparison of m sequences, there are two or more
nucleotides at that site; nucleotide diversity is defined as the average number of nucleotide differences per site
between two sequences). If all the alleles are selectively neutral, then the product 4Nv (where N is the effective
population size and v is the mutation rate per site) can be estimated in two ways, and the difference in the estimate
obtained provides an indication of non-neutral evolution. Please see Nei and Kumar (2000) (page 260-261) for further
description.

Exploring the impact of nsSNVs

Introduction to myPEG

Computational diagnosis of amino acid variants in the human exome is the first step in assessing the disruptive impacts
of non-synonymous single nucleotide variants (nsSNVs) on human health and disease. MEGA-MD
(Molecular Evolutionary Genetics Analysis – Mutation Diagnosis) is a client-server application used to forecast the
deleteriousness of nsSNVs using multiple methods and explore them in the context of the variability permitted in the
long-term evolution of the affected positions.
MEGA-MD accesses a relational database (MD-DB) resident on our servers that contains pre-computed diagnoses, and
associated information, for all possible mutations at all amino acid positions in the human exome. We have included
three primary methods (PolyPhen-2, SIFT, and EvoD) of predicting the functional impact of amino acid variants. The
first two are the most popular methods and the third significantly improves the performance for nSNVs found at ultra-
conserved and at fast-evolving positions (Kumar et al., 2012). The PolyPhen-2 and SIFT diagnoses were obtained
from dbNSFP. We have also included results from a multi-method consensus diagnosis, because they have been shown
to be more reliable. In this case, we use the evolutionarily-balanced (see Liu and Kumar 2013) versions of PolyPhen-2
and SIFT diagnosis.
In addition to retrieving pre-computed predictions for variants in the human exome, MEGA-MD provides a facility to
infer ancestral states for the position where a given amino acid mutation is found. Maximum parsimony and maximum
likelihood approaches are supported by this utility which uses the 46 species reference phylogeny along with the 46
species peptide alignment for the relevant gene (obtained from the UCSC resource).
MEGA-MD is developed using the MEGA (Molecular Evolutionary Genetics Analysis) software package.

MEGA-MDW Server

All EvoD, PolyPhen-2, and SIFT predictions are pre-computed and stored on the MEGA-MDW web server. For all
variants of interest, predictions of functional impact and related data are retrieved from the MEGA-MDW web server and
displayed by the MEGA-MD rich graphical user interface (GUI).
The MEGA-MDW server can also be accessed directly through its web interface, although it does not provide the same
rich functionality that is found in the MEGA-MD desktop client application. However, for large numbers (e.g. > 10,000)
of nsSNVs, the MEGA-MDW Server web interface may be more suitable than the MEGA-MD client application
(depending on your internet connection speed) as the retrieval of data for many nsSNVs may take some time. The
MEGA-MDW server can be accessed from any web browser at www.mypeg.info/evod .

MEGA-MD Windows

Mutation Explorer

The Mutation Explorer window displays predictions and data associated with the nsSNVs being explored and provides
functionality for text searching, sorting, importing, exporting, formatting, and gene search. This window displays two
main views, each located on a separate tab:
Gene Search Tab
Prediction Data Tab

154
The actions provided by the Mutation Explorer are divided into several categories and are accessed using the main menu
bar or the main tool bar:
File
· Import Query Data From File – load coordinate information form a text file
· Search for a Gene – access the gene search page
· Export Table to Excel File – save all prediction data to an MS Excel file
· Export Table to CSV File – save all prediction data to a Comma-Separated-Values text file
· Exit – Close the application

Edit
· Copy – copy selected values to the system clip-board
· Select All – select all values in the table
· Clear Table – clear all data from the table

Format
· Increase Precision – increase the precision of all numeric values in the table (and also in the Mutation Detail
View window)
· Decrease Precision - decrease the precision of all numeric values in the table (and also in the Mutation Detail
View window)
· Resize Columns to Best-fit – resizes all columns in the table to achieve the best fit and optimize the view.
Useful when hiding/showing columns and column widths change sub-optimally. ***note: if there are many
records in the table (more than several thousand), this operation may take a few moments or more, during which
time the window will be unresponsive.

Search
· Find… - text search for values in the table
· Find Next – find the next value matching the search query (search goes to the right and then down to the next
row)

Options
· Keep detail view on top – toggle this action on/off to keep the Mutation Detail View window from staying in
front of other MEGA-MD windows (on by default).
· Show Toolbar – toggle on/off the display of the toolbar (on by default)
· Toggle Auto Column Width – when off (default) a horizontal scroll bar is used to view columns that don’t fit in
the window. When off, the horizontal scroll bar is removed and all columns are squeezed into view.
155
Windows
· Detail View Form – show the Mutation Detail View window
· Search for a Gene – jump to the Gene Search tab in the Mutation Explorer window
· Sequence Data Explorer – show the Sequence Data Explorer window
Help
· Contents – Display this help document
· About – show the About MEGA-MD window

Mutation Detail View

The Mutation Detail View window displays all available information for the currently active record (selected in
the Mutation Explorer window). Additionally, this window provides access to the 46-species reference alignment for
the given gene as well as the ability to infer ancestral alleles using the Maximum Likelihood (ML) or Maximum
Parsimony (MP) methods.
When the Explore Alignment button is clicked, MEGA-MD will retrieve the 46-species reference alignment from the
MEGA-MDW server and display it in the Sequence Data Explorer, from where it can be exported or further exploration
can be done.
When the Explore Ancestors button is clicked, the choice of ML and MP methods are presented. If the ML approach is
selected, the Analysis Preferences Dialog is displayed from which the analysis can be launched with custom settings (e.g.
substitution model, distribution of rates, etc…). If the MP approach is selected, the analysis is launched immediately as
not custom settings are available for this method. When the analysis is completed, the reference topology will be
displayed in the Tree Explorer along with inferred ancestral alleles for the amino acid site designated earlier.

156
157
Sequence Data Explorer Window

The Sequence Data Explorer is used to display the 46-species alignment for a given gene and provides a graphical
interface for specifying amino acid position and mutant allele for nsSNVs of interest. With an alignment activated, the
amino acid position is specified by selecting the site of interest (which will be highlighted). With the site of interest
selected, the mutant allele (or all alleles) can be specified from the Diagnose Variant drop down list. When an allele is
selected from the list, MEGA-MD will query the MEGA-MDW server and append the returned predictions and related
data to the Mutation Explorer Predictions tab.
The Sequence Data Explorer window also provides much other functionality such as alignment export and composition
based exploration.

Gene Search Tab

The Gene Search tab facilitates searching for genes by keyword (based on gene product) or alternatively by RefSeq identifiers
(mRNA ID or Protein ID). Search results (limited to 1000) are displayed in a list view with cursory information and a link for
retrieving the 46-species reference protein sequence alignment from the EvoD server. When a sequence alignment is retrieved it is
displayed in the Sequence Data Explorer which can be used to specify the amino acid site and mutant allele for a nSNV of
interest.

158
Prediction Data Tab

The Predictions tab displays all prediction data retrieved from the MEGA-MDW server in a list view. Complete
information for the currently active record is displayed in the Mutation Detail View. Columns of data are banded
together into categories:
· Mutations – identifiers as well as mutant and reference alleles are given here. Note – mutant amino acids that
are appended with an asterisk (*) have multiple rows returned by the MEGA-MD server, each row indicating a
mutation at the nucleotide level (look to the Coordinate Info band to see nucleotide change).
· Predictions – consensus, EvoD, PolyPhen-2, and SIFT predictions are given here. Where both the original and
balanced predictions are given for PolyPhen-2 and SIFT (balanced predictions are described in Liu and Kumar
2013).
· Impact – the impact scores for EvoD, PolyPhen-2, and SIFT predictions are provided along with the Grantham
distance and Blosum62 value.
· Evolutionary Features (hidden by default) – substitution rate, position time span, and mutation time span are
displayed (see below for a description of how to display this band).
· Coordinate Info (hidden by default) – additional coordinate information is shown here, including chromosome,
strand, nucleotide position, amino acid position, wild nucleotide, and mutant nucleotide (see below for a
description of how to display this band).

To toggle on/off the display of a given band, click on the indicator button which is located to the far left in the band
headers row. A popup menu will appear from which bands can be selected/deselected. Often times when changing the
display of bands, column widths will change in undesirable ways. To remedy this, you can execute the Best-fit
Columns action by clicking Format->Resize columns to best-fit or clicking the toolbar button. Alternatively, columns
widths can be adjusted by dragging their header edges.
The toolbar and main menu provide access to several actions for importing/exporting data, formatting the view, sorting,
text search, and setting view options.

159
INPUT DATA

Overview

In order to retrieve predictions for a given nsSNV, MEGA-MD requires three pieces of information:
1. RefSeq protein id (e.g. NP_000082)
2. amino acid position (e.g. 43)
3. mutant allele (e.g. R)
There are two ways to provide this coordinate information to MEGA-MD
Upload a text file
Use the interactive wizard (via Gene Search and Sequence Data Explorer)

Upload a text file with the coordinate information for all nSNVs of interest

Create a text file with coordinate information for all nsSNVs to be explored following the format below:

NP_000758 99 E
NP_000761 264 M
NP_000762 144 C
NP_000762 335 W
NP_000773 374 T
NP_000838 71 L
NP_000886 131 H
NP_000887 271 T

Each line contains coordinate information for one nsSNV and each value is separated by white space (i.e. spaces or tabs).

In the Mutation Explorer window, select File->Import Query Data From File (or click the upload data button) and
browse for the newly created text file. MEGA-MD will first validate the format of the coordinate information file and
then request prediction information for all specified nsSNVs from the MEGA-MDW web server. As data is retrieved, the
Mutation Explorer window is updated.

160
The MEGA-MD application has no limit on the number of entries that can be included in the coordinate information file.
However, depending on your internet connection speed and the current load on the MEGA-MDW server, retrieval of
many predictions may take some time (anything less than 5,000 should not be problematic). For situations where MEGA-
MD does not perform optimally due to high numbers of nsSNVs, the MEGA-MDW can be used directly
(www.mypeg.info/evod ). The same text file can be uploaded to the MEGA-MD server which will process the file and
send you an email for retrieving prediction data once the processing is complete.

Specify the coordinate information using the Sequence Data Explorer

If a 46-species sequence alignment has been retrieved (see Gene Search) for a given gene, the Sequence Data Explorer window
can be used to first navigate to the amino acid site of interest and then specify a mutant allele.

IDENTIFYING GENE DUPLICATIONS

Gene Duplication Inference

Gene Duplication Wizard


MEGA7 introduces a new wizard-style system for identifying gene duplications (and optionally, speciation events) in
a gene family tree. This program is available from the main MEGA form in the User Tree menu. When the analysis is
launched, the Gene Duplication Wizard, which guides the user through the steps for inferring gene duplications, is
shown.
Input Data
• Gene Tree File – the gene family tree in a Newick formatted file is required for the analysis.
• Species Tree File – an optional species tree file in Newick format. If a species tree is provided, then the
algorithm described in Zmasek and Eddy (2001) will be used to infer gene duplications and speciation events
for all internal nodes in the gene tree. If the species tree is not provided, then all internal nodes in the tree that
contain at least one common species in the two descendant clades will be marked as gene duplication events.

161
• Mapping of taxa names to species names – the species name for each taxon must be provided and a simple
grid-like dialog is provided for completing this task. With this dialog, users can either manually enter the
species name for each taxon or load the names from a text file that gives the mapping in the form
taxonName=speciesName
for each taxon and each mapping is on its own line.
Steps for Doing the Gene Duplication Analysis
1. Load the gene tree file – in the first step, the wizard is used to browse for and load the Newick
formatted gene tree file
2. Map species names to taxa names - in the second step, species names are mapped to taxa names using a grid-
like interface. Species names can be entered manually or imported from a text file that gives the name for each
taxon as
taxonName=speciesName
and each mapping is on a separate line.
3. Load an (optional) species tree – if a species tree is available, the wizard can be used to browse for and load
the Newick formatted species tree file.
4. Root the gene tree (optional) – if the root of the gene tree is known, the wizard can be used to specify the root.
If this option is chosen, the gene tree will be displayed in the Tree Explorer window and users can specify the
root by clicking on a branch or node to root the tree on. If the root is not known, the analysis will be
performed with all possible root placements and the placement(s) of the root that results in the minimum
number of gene duplications will be kept and all others discarded.
5. Root the species tree – the analysis requires that the species tree, if provided, is rooted. Rooting the species
tree is done in the same way as rooting the gene tree, via the tree explorer.
6. Launch the Analysis – the final step is to launch the analysis. A progress window is displayed while the
calculation is executed and once the analysis is complete, the gene family tree is displayed in the Tree
Explorer window with gene duplications marked by solid blue diamonds in the tree and if a species tree was
provided, speciation events are marked by open red diamonds in the tree.

Evolutionary Probabilities
The Evolutionary Probabilities (EP) analysis in MEGA is used for predicting permissible and forbidden mutations
from an evolutionary perspective. This tool computes evolutionary probabilities (EP’s) (Liu et al. 2016) of alleles in
DNA and protein sequences based on long-term substitution patterns contained in multiple sequence alignments. The
EP value of an allele gives an evolutionary expectation of observing an allele in a population. The implementation of
the EP calculation in MEGA differs from that described by Liu et al in that divergence time estimates are not required
a priori but rather are estimated by MEGA using the RelTime method (Tamura et al. 2012). The MEGA GUI provides
a wizard-style system that walks the user through the steps required to set up the analysis.

1. The EP analysis in MEGA requires 2 input data files and the wizard system prompts the user for them.
1. The first input is a multiple sequence alignment where the first sequence in the alignment is the focal
sequence for which EP values will be calculated.
2. The second input is a Newick formatted file that gives the evolutionary relationships for the sequences
contained in the input sequence alignment.
2. After loading the input files, MEGA prompts the user to specify an outgroup which can be done in either of
two ways
1. The tree can be displayed in the Tree Explorer so that the outgroup can be specified by clicking on a
branch in the tree
2. The list of taxa can be displayed in the Taxa/Groups dialog and the outgroup can be specified by
selecting taxa names
3. The EP Wizard prompts for analysis options (substitution model, rates and patterns, data sub-setting, etc…) to
be used by displaying the Analysis Preferences dialog.

Once set up is complete and the user launches the calculation, MEGA displays a progress dialog as EP values are
calculated for all sites included after sub-setting of the data. To compute EP values at a given site, MEGA computes a
set of posterior probabilities of observing a specific allele at that site in the focal species. The first value in this set is

162
computed using the full data set. The other values in the set are computed by progressively pruning the sister species
or group closest to the focal species. Pruning stops when the tree has only the focal species and the outgroup. During
this process, MEGA also computes relative times of divergence at each step and uses these divergence times to
compute the evolutionary time span (ETS, see Liu et al. 2016) at each step of the procedure. The ETS values are used
to formulate a weighted mean of the set of posterior probabilities which give the EP values at the current site. The
final result is the EP value for all possible bases (4 for DNA, 20 for amino acids) at each site in the input sequence
alignment and the result can be displayed in a spreadsheet or text format.

PART-V: VISUALIZING AND EXPLORING DATA

Distance Matrix Explorer

Distance Matrix Explorer

The Distance Matrix Explorer is used to display results from the pairwise distance calculations. It is an
intelligent viewer with the flexibility of altering display modes and functionalities and for computing within
groups, among groups, and overall averages.

This explorer consists of a number of regions as follows:


Menu Bar
File Menu
Display Menu
Average Menu
Help: This button brings up the help file.
Tool Bar
The tool bar provides quick access to a number of menu items.
General Utilities
Lower-left Triangle button: Click this icon to display pairwise distances in the lower-left matrix. If
standard errors (or other statistics) are shown, they will be displayed in the upper-right.
Upper-right Triangle button: Click this icon to display pairwise distances in the upper-right matrix. If
standard errors (or other statistics) also are shown, they will be displayed in the lower-left.
(A, B): This button is an on-off switch to write or hide the name of the highlighted taxa pair. The taxa pair
is displayed in the status bar below.
Distance Display Precision
: This decreases the precision of the distance display by one decimal place with each click of the
button.
: This increases the precision of the distance display by one decimal place with each click of the button.
Column Sizer: Has been replaced with an new system. Now simply move your mouse over the divider
between the sequence name and the first column of data. Your cursor will change(to an arrow pointing left
and right). Click and drag to resize the names.
Export Data
: This brings up the Exporting Sequence Data dialog box, which contains options to control how
MEGA writes the output data, available options are Text, MEGA, CSV, and Excel.
The 2-Dimensional Data Grid
This grid displays the pairwise distances between taxa (or within groups etc.) in the form of a lower or upper
triangular matrix. The taxa names are the row-headers; the column headers are numbered from 1 to m,
with m being the number of taxa. There is a column sizer for the row-headers, so that you can increase or
decrease the column size to accommodate the full name of the sequences or groups.
Fixed Row: This is the first row in the data grid and displays the column number.
Fixed Column: This is the first and leftmost column in the data grid. This column is always visible even if
you scroll past the initial screen. It contains taxa names and an associated check box. To include or

163
exclude taxa from analysis, you can check or uncheck this box. In this column, you can drag-and-
drop taxa names to sort them.
Rest of the Grid: Cells to the right of the first column and below the first row contain the nucleotides or
amino acids of the input data. Note that all cells are drawn in light color if they contain data corresponding
to unselected sequences or genes and domains.
Status bar
The left sub-panel shows the name of the statistic for the currently selected value. In the next panel, the
status bar shows the taxa-pair name for the selected value.

Average Menu (in Distance Matrix Explorer)

With this menu, you can compute the following average values:
Overall: Computes and displays the overall average.
Within groups: This item is enabled only if at least one group is defined. For each group, an arithmetic
average is computed for all valid pairwise comparisons and the results are displayed in the Distance Matrix
Explorer. All incalculable within-group averages are shown with an “n/c” in red.
Between Groups: This item is enabled only if at least two groups of taxa are defined. For each between-
group average, an arithmetic average is computed for all valid inter-group pairwise comparisons and results
are displayed in the Distance Matrix Explorer. All incalculable within-group averages are shown with an
“n/c” in red.
Net Between Groups: This item is enabled only if at least two groups of taxa are defined. It
computes net average distances between groups of taxa. This value is given by
dA = dXY – (dX + dY)/2
where dXY is the average distance between groups X and Y, and dX and dY are the mean within-group
distances. You must have at least two groups of taxa with a minimum of two taxa each for this option to
work. All incalculable within-group averages are shown with a red “n/c”.

Display Menu (in Distance Matrix Explorer)

The display menu consists of four main commands:


Show Pair Name: This is a toggle to write or hide the name of the taxa pair highlighted, which is displayed
in the status bar below.
Sort Taxa: This provides a submenu for sorting the order of taxa in one of three ways: by input order, by
taxon name or by group name.
Show Names: This is a toggle for displaying or hiding the taxa name.
Show Group Names: This is a toggle for displaying or hiding the group name next to the name of each
taxon, when available.
Change Font: This brings up the dialog box that allows you to choose the type and size of the font for
displaying the distance values.

File Menu (in Distance Matrix Explorer)


The file menu consists of three commands:
Show Input Data Title: This displays the title of the input data.
Show Analysis Description: This displays various options used to calculate the quantities displayed in the
Matrix Explorer.
Export/Print Distances: This brings up a dialog box for writing pairwise distances as a text file, CSV, or
Excel, with a choice of several formats.
Quit Viewer: This exits the Distance Data Explorer.

Sequence Data Explorer


164
The Sequence Data Explorer shows the aligned sequence data. You can scroll along the alignment using the scrollbar
at the bottom right hand side of the explorer window. The Sequence Data Explorer provides a number of utilities for
exploring the statistical attributes of the data and also for selecting data subsets.

This explorer consists of a number of regions as follows:


Menu Bar
Data menu
Search menu
Display menu
Highlight menu
Statistics menu
Help: This item brings up the help file for the Sequence Data Explorer.
Tool Bar
The tool bar provides quick access to the following menu items:
General Utilities

: This brings up the Exporting Sequence Data dialog box, which contains options to control how MEGA writes the
output data, available options are Text, MEGA, CSV, and Excel.

: This brings up the Exporting Sequence Data dialog box and sets the default output format to MEGA.

: This brings up the Exporting Sequence Data dialog box and sets the default output format to Excel.

: This brings up the Exporting Sequence Data dialog box and sets the default output format to CSV (Comma
separated values).

: This brings up the dialog box for setting up and selecting domains and genes.

: This brings up the dialog box for setting up, editing, and selecting taxa and groups of taxa.

: This toggle replaces the nucleotide/amino acid at a site with the identical symbol (e.g. a dot) if the site contains
the same nucleotide/amino acid.

: This button provides the facility to translate codons in the sequence data into amino acid sequences and
back. All protein-coding regions will be automatically identified and translated for display. When the translated
sequence is already displayed, then issuing this command displays the original nucleotide sequences (including all
coding and non-coding regions). Depending on the data displayed (translated or nucleotide), relevant menu options in
the Sequence Data Explorer become enabled. Note that the translated/un-translated status in this data explorer does
not have any impact on the options for analysis available in MEGA (e.g., Distances or Phylogeny menus),
as MEGA provides all possible options for your dataset at all times.

165
Highlighting Sites
C: If this button is pressed, then all constant sites will be highlighted. A count of the highlighted sites will be
displayed on the status bar.
V: If this button is pressed, then all variable sites will be highlighted. A count of the highlighted sites will be
displayed on the status bar.
Pi: If this button is pressed, then all parsimony-informative sites will be highlighted. A count of the highlighted sites
will be displayed on the status bar.
S: If this button is pressed, then all singleton sites will be highlighted. A count of the highlighted sites will be
displayed on the status bar.
L: If this button is pressed, then all labelled sites will be highlighted and a count of highlighted sites will be displayed
on the status bar (see also labelled sites).
0: If this button is pressed, then sites will be highlighted only if they are zero-fold degenerate sites in all sequences
displayed. A count of highlighted sites will be displayed on the status bar. (This button is available only if the dataset
contains protein coding DNA sequences).
2: If this button is pressed, then sites will be highlighted only if they are two-fold degenerate sites in all sequences
displayed. A count of highlighted sites will be displayed on the status bar. (This button is available only if the dataset
contains protein coding DNA sequences).
4: If this button is pressed, then sites will be highlighted only if they are four-fold degenerate sites in all sequences
displayed. A count of highlighted sites will be displayed on the status bar. (This button is available only if the dataset
contains protein coding DNA sequences).
Special: This dropdown allows for the selection of a special highlighting option.
CpG/TpG/CpA: if this button is pressed, then all sites which have a C followed by a G, T by G, or C by A will be
highlighted. You may also select a percentage of sequences which must have these properties for a site to be counted.
Coverage: if this button is pressed, then you will enter a percentage. All the sites with this percentage or less of
ambiguous sites will be highlighted.

: This button allows you to quickly navigate between highlighted sites by jumping to the previous or next
highlighted site.
Searching

: This button allows you to specify a sequence name to find. Search results are bolded and the row is highlighted
blue. MEGA first looks for an exact match to the name you specified, if none exists it looks for names starting with
what you provided, if no names start with the provided search term, then MEGA looks for your search term anywhere
in the names(rather than just the start).

: This button allows you to specify a Motif to search for in the sequence data. This Motif supports IUPAC codes
such as R (for A or G) and Y (for T or C). MEGA highlights (in Yellow) the first instance of this motif it finds.

and : These buttons are only enabled if you have already searched for a Sequence Name or Motif. By
clicking the forward or backward button MEGA will search for the next or previous search result (assuming there is
more than one possible matches).
The 2-Dimensional Data Grid

166
Fixed Row: This is the first row in the data grid. It is used to display the nucleotides (or amino acids) in the first
sequence when you have chosen to show their identity using a special character. For protein coding regions, it also
clearly marks the first, second, and the third codon positions.
Fixed Column: This is the first and the leftmost column in the data grid. It is always visible, even when you are
scrolling through sites. The column contains the sequence names and an associated check box. You can check or
uncheck this box to include or exclude a sequence from analysis. Also in this column, you can drag-and-drop
sequences to sort them.
Rest of the Grid: Cells to the right of and below the first row contain the nucleotides or amino acids of the input
data. Note that all cells are drawn in light color if they contain data corresponding to unselected sequences or genes
or domains.
Status Bar
This section displays the location of the focused site and the total sequence length. It also shows the site label, if any,
and a count of the highlighted sites.

Data Menu (Sequence Data Explorer)


This menu provides commands for working with selected data in the Sequence Data Explorer
The commands in this menu are:
Write Data to File Brings up the Exporting Sequence Data dialog box.
Translate/Untranslate Translates protein-coding nucleotide sequences into protein sequences, and back to
nucleotide sequences.
Select Genetic Code Table Brings up the Select Genetic Code dialog box, in which you can select, edit or add a
genetic code table.
Setup/Select Genes and Brings up the Sequence Data Organizer, in which you can define and edit genes
Domains and domains.

Setup/Select Taxa and Groups Brings up the Setup/Select Taxa & Groups Dialog dialog, in which you can
edit taxa and define groups of taxa.
Quit Data Viewer Takes the user back to the main interface.

Translate/Untranslate (in Sequence Data Explorer)


Data | Translate/Untranslate
This command is available only if the data contain protein-coding nucleotide sequences. It automatically extracts all
protein-coding domains for translation and displays the corresponding protein sequence. If the translated sequence is
already displayed, then issuing this command displays the original nucleotide sequences, including all coding and
non-coding regions. Depending on the data displayed (translated or nucleotide), relevant menu options in the
Sequence Data Explorer are enabled. However, translated and un-translated status does not have any impact on the
analytical options available in MEGA (e.g., Distances or Phylogeny menus), as MEGA provides all possible options
for your dataset at all times.

Select Genetic Code (in Sequence Data Explorer)


Data | Select Genetic Code Table
Select Genetic Code Table, can be invoked from within the Data menu in Sequence Data Explorer, and is also
available in the main interface directly in the Data Menu.

Setup/Select Taxa & Groups (in Sequence Data Explorer)


Data | Setup/Select Taxa & Groups
167
Setup/Select Taxa & Groups, can be invoked from within the Data menu in Sequence Data Explorer, and is
also available in the main interface directly in the Data Menu.

Setup/Select Genes & Domains (Sequence Data Explorer)


Data | Setup/Select Genes & Domains
Setup/Select Genes & Domains, can be invoked from within the Data menu in Sequence Data Explorer, and is also
available in the main interface directly in the Data Menu.

Export Data (Sequence Data Explorer)


Data | Export Data
The Exporting Sequence Data dialog box first displays an edit box for entering a title for the sequence data being
exported. The default name is the original name of the data set, if there was one. Below the title is a space for
entering a brief description of the data set being exported.
Next is the option for determining the format of the data set being exported; MEGA currently allows the user to export
the data in MEGA, PAUP 3.0 and PAUP 4.0 (Nexus, Interleaved in both cases), and PHYLIP 3.0 (Interleaved). tA the
end of each line, is “Writing site numbers.” The three options available are to not write any number, to write one for
each site, or to write the site number of the last site.
Other options in this dialog box include the number of sites per line, which codon position(s) is to be used and
whether non-coding regions should be included, and whether the output is to be interleaved. For missing or
ambiguous data and alignment gaps, there are four options: include all such data, exclude all such data, exclude or
include sites with missing or ambiguous data only, and exclude sites with alignment gaps only.

Quit Data Viewer

Data | Quit Data Viewer


This command closes the Sequence Data Explorer, and takes the user back to main interface.

Display Menu (in Sequence Data Explorer)

This menu provides commands for adjusting the display of DNA and protein sequences in the grid.
The commands in this menu are:
Show only selected sequences: To work only in a subset of the sequences in the data set, use the check
boxes to select the sequences of interest.
Use Identical Symbol: If this site contains the same nucleotide (amino acid) as appears in the first sequence
in the list, this command replaces the nucleotide (amino acid) symbol with a dot (.). If you uncheck this
option, the Sequence Data Explorer displays the single letter code for the nucleotide (amino acid).
Color Cells: This option displays the sequences such that consecutive sites with the same nucleotide (amino
acid) have the same background color.
Select Color: This option changes the color for highlighted sites. It is Yellow by default.
Sort Sequences: The sequences in the data set can be sorted based on several options: sequence names,
group names, group and sequence names, or as per the order in the Select/Edit Taxa Groups dialog box.
Restore input order: This option resets any changes in the order of the displayed sequences (due to sorting,
etc.) back to that in the input data file.

168
Show Sequence Name: The name of the sequences can be displayed or hidden by checking or unchecking
this option. If the sequences have been grouped, then unchecking this option causes only the group name to
be retained. If no groups have been made, then no name is displayed.
Show Group Name. This option can be used to display or hide group names if the taxa have been
categorized into groups.
Change Font. Brings up the Font dialog box, allowing the user to choose the type, style, size, etc. of the
font to display the sequences.

Restore Input Order

Display | Restore Input Order


Choosing this restores the order in Sequence Data Explorer to that in the input text file.

Show Only Selected Sequences

Display | Show only Selected Sequences


The check boxes in the left column of the display grid can be used to select or deselect sequences for
analysis. Subsequent use of the “Show Only Selected Sequences” option in the Display menu of Sequence
Data Explorer hides all the deselected sequences and displays only the selected ones.

Color Cells
Display | Color cells
This command colors individual cells in the two-dimensional display grid according to the nucleotide or
amino acid it contains. A list of default colors, based on the biochemical properties of the residues, is given
below. In a future version, these colors will be customizable by the user.

For DNA sequences:


SymbolColor
A Yellow
G Fuchsia
C Olive
T Green
U Green

For amino acid sequences:


SymbolColor SymbolColor
A Yellow M Yellow
C Olive N Green
D Aqua P Blue
E Aqua Q Green
F Yellow R Red
G Fuchsia S Green
H Teal T Green
I Yellow V Yellow
K Red W Green
L Yellow Y Lime

169
Use Identical Symbol
Display | Use Identical Symbol
Data that contain multiple aligned sequences may be easier to view if, when the nucleotide (amino acid) is
the same as that in the corresponding site in the first sequence, the nucleotide (amino acid) is replaced by a
dot. Choosing this option again brings back the nucleotide (amino acid) single-letter codes.

Show Sequence Names


Display | Show Sequence Names
This option displays the full sequence names in Sequence Data Explorer

Show Group Names


Display | Show Group Names
This option displays the full group names in Sequence Data Explorer if the sequences have been grouped
in Select/Edit Taxa Groups

Change Font...
Display | Change Font…
This command brings up the Change Font dialog box, which allows you to change the display font,
including font type, style and size. Options to strikeout or underline selected parts of the sequences are also
available. There is also an option for using different scripts, although the only option currently available is
“Western”. Finally the “Sample” window displays the effects of your choices

Sort Sequences
Display | Sort Sequences
The sequences in the data set can be sorted based on several options: sequence name, group name, group and
sequence names, or as per the order in the Select/Edit Taxa Groups dialog box.

Sort Sequences by Group Name


Display | Sort Sequences | By Group Name
Sequences that have been grouped in Select/Edit Taxa Groups can be sorted by the alphabetical order
of group names or numerical order of group ID numbers. If the group names contain both a name and a
number, the numerical order will be nested within the alphabetical order.

Sort Sequences by Group and Sequence Names


170
Display | Sort Sequences | By Group and Sequence Names
Sequences that have been grouped in Select/Edit Taxa Groups can be sorted by the alphabetical order
of group names or the numerical order of group ID numbers. If the group names contain both a name and a
number, the numerical order is nested within the alphabetical order. The sequences can be further arranged
by sorting the sequence names within the group names.

Sort Sequences As per Taxa/Group Organizer


Display | Sort Sequences | As per Taxa/Group Organizer
The sequence/group order seen in Select/Edit Taxa Groups is initially the same as the order in the input text
file. However, this order can be changed by dragging-and-dropping. Choose this option if you wish to see
the data in the same order in the Sequence Data Explorer as in Select/Edit Taxa Groups.

Sort Sequences By Sequence Name


Display | Sort Sequences | By Sequence Name
The sequences are sorted by the alphabetical order of sequence names or the numerical order of sequence ID
numbers. If the sequence names contain both a name and a number, then the sorting is done with the
numerical order nested within the alphabetical order.

Highlight Menu (in Sequence Data Explorer)


This menu can be used to highlight certain types of sites. The options are constant sites, variable
sites, parsimony-informative sites,singleton sites, 0-fold, 2-fold and 4-fold degenerate sites.

Highlight Conserved Sites


Highlight | Conserved Sites
Use this command to highlight constant sites

Highlight Variable Sites


Highlight | Variable Sites
Use this command to highlight variable sites sites.

Highlight Singleton Sites


Highlight | Singleton Sites
Use this command to highlight singleton sites.

171
Highlight Parsimony Informative Sites
Highlight | Parsim-Info Sites
Use this command to highlight parsimony-informative sites.

Highlight 0-fold Degenerate Sites


Highlight | 0-fold Degenerate Sites
Use this command to highlight 0-fold degenerate sites.

Highlight 2-fold Degenerate Sites


Highlight | 2-fold Degenerate Sites
Use this command to highlight 2-fold degenerate sites. The command is visible only if the data consists of
nucleotide sequences.

Highlight 4-fold Degenerate Sites


Highlight | 4-fold Degenerate Sites
Use this command to highlight 4-fold degenerate sites. The command is visible only if the data consists of
nucleotide sequences.

Statistics Menu (in Sequence Data Explorer)

Various summary statistics of the sequences can be computed and displayed using this menu. The
commands are:
Nucleotide Composition
Nucleotide Pair Frequencies
Codon Usage
Amino Acid Composition
Use All Selected Sites
Use only Highlighted Sites. Sites can be selected according to various criteria (see Highlight Sites), and
analysis can be performed only on the chosen subset of sites.
Display results in Excel (XL) - Only effects outputs from the Statistics menu
Display results in Comma-Delimited (CSV) - Only effects outputs from the Statistics menu
Display results in Text Editor - Only effects outputs from the Statistics menu

Nucleotide Composition

Statistics | Nucleotide Composition


This command is visible only if the data consist of nucleotide sequences. MEGA computes the base
frequencies for each sequence as well as an overall average. These will be displayed by domain in a Text
Editor domain (if the domains have been defined in Setup/Select Genes & Domains).

172
Nucleotide Pair Frequencies

Statistics | Nucleotide Pair Frequencies


This command is visible only if the data consists of nucleotide sequences. There are two options available:
one in which the nucleotide acid pairs are counted bidirectionally site-by-site for the two sequences (giving
rise to 16 different nucleotide pairs), the other, in which the pairs are counted unidirectionally (10 nucleotide
pairs). MEGA will compute the frequencies of these quantities for each sequence as well as an overall
average. They will be displayed by domain (if domains have been defined in Setup/Select Genes
& Domains).

Codon Usage

Statistics | Codon Usage


This command is visible only if the data contains protein-coding nucleotide sequences. MEGA computes
the percent codon usage and the RCSU values for each codon for all sequences included in the
dataset. Results will be displayed in by domain (if domains have been defined in Setup/Select Genes
& Domains).

Amino Acid Composition

Statistics | Amino acid Composition


This command is visible only if the data consists of amino acid sequences or if the translated protein coding
nucleotide sequences are displayed. MEGA will compute the amino acid frequencies for each sequence as
well as an overall average, which will be displayed by domain (if domains have been defined in
Setup/Select Genes & Domains).

Use All Selected Sites

Statistics | Use All Selected Sites


Analysis is conducted on all sites in the sequences, irrespective of whether any sites have been labeled or
highlighted.

Use only Highlighted Sites

Statistics | Use only Highlighted Sites


Sites can be selected according to various criteria (see Highlight Sites), and analyses will be performed only
on the chosen subset of sites. All statistical attributes will be based on these sites.

Tree Explorer

Phylogeny | Any tree-building option


The Tree Explorer displays the evolutionary tree based on the options used to compute or display the
phylogeny. The main menu of the Tree Explorer has the following items:
File Menu
Image Menu
Sub-tree Menu
View Menu
173
Compute Menu

Information Box

The information box in the Tree Explorer lists the various statistical attributes of the displayed tree with
the branch or node highlighted. It usually contains multiple tabs.
General: This reminds the user of the number of taxa (and groups, if any) and of the strategy used to deal
with gaps and missing data.
Tree: This contains information about the type of tree –rooted/unrooted, and the sum of branch lengths,
SBL, or the tree-length. In addition, information about the total number of trees and the tree number of the
current tree is displayed.
Branch: In the Tree Explorer window you may click on a branch or on a node of the tree. If you click on a
branch, this tab displays its location in terms of the two nodes it connects. (Leaf taxa are numbered in the
order in which they appear in the input data file.) This window also displays the length of the selected
branch. If you click on a node, the internal identification number of that node is displayed.

File Menu (in Tree Explorer)

This menu has the following options:


Save Current Session: This brings up the Save As dialog box and saves all the information currently held
by the Tree Explorer to a file in a binary format. This feature allows you to retrieve the current Tree
Explorer session for tree manipulation and printing.
Export Current Tree (Newick): This writes the topology of the current tree in the MEGA tree format to a
specified file. Note that only the branching pattern is stored.
Export Current Tree (Time Tree): For Time Trees, this writes the tree in a tabular format which includes
relative divergence times and std errors, relative rates, and absolute divergence times if calibrations were
used when constructing the tree.
Export Current Calibrations (Time Tree): For Time Trees, exports divergence time calibration
constraints that were provided for generating a Time Tree.
Export All Trees (Newick): This writes the topologies of all trees in the MEGA tree format to a specified
file. Note that only the branching pattern is stored.
Export Analysis Summary: Export a text file that has the analysis settings and data details for the
constructed tree.
Export Partition List: Export a text file that gives the frequency of occurrence for partitions found during
the bootstrapping process for tree construction with the bootstrap test.
Export Pairwise Distances: If a timetree is displayed this option exports a pairwise matrix of times of
divergence for all taxa in the tree. If any other tree is shown, this option exports a pairwise matrix of patristic
distances for all taxa in the tree.
Write Tree in a Table Format: Export the tree to a text file that shows parent/child relationships in a
tabular format and includes branch lengths if they are available.
Show Info: This brings up the Information dialog box.
Print: This brings up the Print dialog box and prints the current tree in the displayed size; if the displayed
tree is larger than the page size, it will be printed on multiple pages.
Print in a sheet: This brings up the Print dialog box and prints the current tree, after restricting the size of
the printed tree to one sheet. The current tree also can be printed using the button on the toolbar.
Printer Setup: This allows the user to setup the printer.
Quit Tree Explorer: This exits the Tree Explorer.

174
Image Menu (in Tree Explorer)

The image menu contains three options:


Copy to Clipboard: This copies the tree image to the clipboard, which can also be done by simultaneously
pressing Ctrl and ‘C’ keys. You then can paste the copied image into any other application (e.g., PowerPoint or
Word).
Save as BMP: This options saves the tree image as a Windows bitmap (BMP) file.
Save as PNG: This option saves the tree image as a Portable Network Graphics (PNG) file.
Save as PDF: This option saves the tree image as a Portable Document Format (PDF) file.
Save as SVG: This option saves the tree image as a Scalable Vector Graphics (SVG) file.
Save as TIFF: This option saves the tree image as a TIFF file.
Save as EMF: This option (only available on Windows) saves the image as an enhanced windows metafile (.EMF).
Loan Taxon Images: This option automatically associates images to each taxon. To use it, you will be prompted for
the directory where the bitmap images (in BMP format) reside. For each taxon, the image file must have a BMP
extension and the filename must be identical to the taxon name displayed in the Tree Explorer. All of the valid
images that are found will be retrieved and displayed.

Subtree Menu (in Tree Explorer)

This menu contains the tree manipulation options Swap, Flip and Compress/Expand. In addition, by
clicking on the corresponding items in the menu (for which there are tool buttons on the left), you can
specify the root of the tree, and display a subtree (a portion of the tree defined by a given internal branch) in
a separate window.
Many of these functionalities are also available through tools in the toolbar on the left side of the displayed
tree.

Subtree Drawing Options (in Tree Explorer)

This dialog box provides choices options for changing various visual attributes for the selected subtree. If
the Overwrite Downstreamoption is checked, any subtree drawing options that have been applied to
downstream nodes within the current subtree will be overwritten.
Property Tab:
Name/Caption: This section allows you to provide an alphanumeric caption for the selected node.
Node/Subtree Marker: This section provides elements for changing the shape and color of the selected
subtree node marker. If the Apply to Taxon Markers option is checked, the selected shape and color options
will be applied to all taxon markers contained within the subtree.
Branch Line: This section provides various drawing options that will be applied to the branch lines of the
selected subtree.
Display Tab:
Display Caption: If checked, the node caption, if set within the Property Tab, will be displayed.
Display Bracket: If checked, this item will display a bracket that encompasses the selected subtree using
the configured bracket drawing options.
Display Taxon Names: If checked, the taxon names attributed to the leaf nodes will be displayed.
Display Node Markers: If checked, any node markers that were configured within the Property Tab will
be displayed.
Display Taxon Markers: If checked, any taxon markers that were configured within the Property Tab will
be displayed.
Compress Subtree: If checked, the selected subtree will be compressed and rendered as a graphical vector
according to the configured drawing options.
175
Image Tab:
Display Image: If checked, the Tree Explorer will display an image, if loaded, at the configured position
relative to the subtree node caption text.

Cutoff Values Tab

In this tab, you can specify a cut-off level for the condensed or consensus trees. Appropriate options
become available depending on the trees displayed.

View Menu (in Tree Explorer)

This menu brings up several viewing options:


Topology only: This displays the tree in the form of relationships among the taxa, ignoring
the branch lengths.
Root on Midpoint: This roots the tree on the midpoint of the longest path between two taxa.
Arrange Taxa: This allows you to arrange the taxa in the tree based on the order of taxa in the input data
file or to produce a tree that looks “balanced.”
Tree/Branch Style: This allows you to select the display of the tree in one of three
styles: Traditional, Radiation, or Circle. For Traditional, there are three additional
options: Rectangular, Straight or Curved.
Show/Hide: This allows you to display or hide the following information: taxon label, taxon marker,
statistics (e.g., bootstrap values), branch lengths, node ids, divergence times (for timetrees), data
coverage (for timetrees) or scale bar.
Fonts: This allows you to choose features such as font type and size for information, including the taxon
label, statistics, and scale bar.
Options: This brings up the Option dialog box, which provides control over various aspects of the tree
drawing, including individual branches, the taxon names, and the scale bar.

Options dialog box (in Tree Explorer)

Through this dialog box, you can specify various drawing attributes for the tree. All options are organized
in five tabs.
Tree
Branch
Labels
Scale
Cutoff

Tree tab (in Options dialog box)

This allows you to manipulate aspects of the tree, depending on the style you used to draw the tree. For
instance, if you used the traditional rectangular style, then you can manipulate the taxon separation distance,
branch length, or tree width, in the number of pixels. This tab also contains a schematic of a tree illustrating
these features.

176
Branch tab (in Options dialog box)

This tab has options for the following aspects of the tree:
Line Width: This allows the user to choose the width of the lines.
Display Statistics/Frequency: This presents the options to Hide or Show the statistics and frequency, to
choose the font, or to alter the placement of the numbers by manipulating the horizontal and vertical
positions.
Display Branch Length: This presents the option to Show the branch length or Hide it if it is shorter than a
specified length, to alter the placement of the written branch lengths, and to choose the number of decimal
places for writing the branch lengths.
Display Divergence Times: This presents the option to Show or Hide divergence times for Time Trees as
well as control formatting of divergence time presentation.

Labels tab (in Options dialog box)

This tab has options for the following:


Display Taxon Names: Presents the option to show (checked) or hide (unchecked) the label and to choose
the font.
Display Markers: Allows you to draw small symbols along with or instead of taxa names in the tree. Two
combo boxes and a list allow you to select the marker graphics and its color.

Scale Bar tab (in Options dialog box)

This tab has options:


Line Width: This drop-down menu allows you to choose the width of the line and the font size used in the
scale bar. Show Distance Scale. This allows you to show or hide the scale bar distance, to enter the unit
used and to choose its length and the interval between tick marks.
Show Time Scale: This presents the option of showing or hiding the divergence time in the scale bar, and to
enter the units used. You also can determine the interval between two major ticks and two minor ticks. To
activate this option the divergence time for a node or the evolutionary rate must be given.

Compute Menu (in Tree Explorer)

This menu makes available various tree computations, including Condensed tree, Time Tree, Consensus
tree, and Calibrate Molecular Clock.

Time Tree Tool

The Timetree tool in the Tree Explorer is used for calculating relative and absolute divergence times for all
branching points in the tree. Using the Timetree tool will produce a time tree with the same topology as the
active tree, where MEGA estimates local clock rates and divergence times for all branching points in the tree
using the RelTime (see Tamura et al. 2012) method. When using this tool, all divergence time estimates are
based solely on the branch lengths in the active tree (MEGA provides options to pre-compute branch lengths
(e.g. using the likelihood-based tool) from the Clocks menu on the main MEGA form).

177
To use the Timetree Tool in Tree Explorer, select Compute | Compute Time Tree (or click the Time Tree
Tool button which looks like a clock). The Timetree Wizard, which specifies the steps for creating a
timetree, will then be displayed.
Once the Time Tree tool is finished, estimated divergence times and local clock rates can be exported to a
text file (File | Export Current Tree (Time Tree)) or viewed in the information window (File | Show
Information).
See also
Time Trees
Time Tree (ML) tutorial
Molecular Clock Test

Tree Topology Editor

The tree topology editor shows a single tree's topology in a way in which you are able to modify it. The toolset is very
similar to the Tree Explorer , but with some features added and others removed.
The Topology Editor is also used during analyses in which the user is supplying their own tree. In some cases
the taxa names in the supplied tree don’t match up exactly with the names in the sequence file we are using. In these
cases users will have a chance to fix the inconsistency by mapping the sequence names to the tree names. Further
below, how mapping names works is described.
Editing the Topology of a Tree
The most basic use of the Topology Editor is to enable the user to build or edit a tree file. The editor can be launched
from the main form by clicking User Tree->Edit/Draw Tree (Manually). If you don’t have a tree file to start off with
then you can choose to either start from scratch or start with a randomly created tree based on your sequence file (this
just saves the time of adding the taxa).

178
This image is showing the Tree Topology Editor with the NJ tree for the Crab_rRNA.meg example file loaded.
Toolbar Explained

- Open a newick tree file

- Open a recently edited tree. This is especially useful in the case where you have a tree file which doesn’t
completely match a sequence file. If you have to resolve the differences MEGA remembers the tree and the mapping
of the taxa. Just select the recently edited tree next time you need to use it with that sequence file.

- Generate a new tree.

- Save to newick file format

- Change the taxa font

- Search for a taxa by name

- Copy an image of the tree to the clipboard (you can past a picture of the tree into word, or another program)

179
- Undo the last change (only applies to topology changes, not taxa name changes)

- Add a new taxa (adds on the currently selected branch, or if no branch is selected it adds to the very top
subtree)

- Delete a taxon (If a taxon is selected it will be drawn in blue, and this option will become enabled. To select a
taxa simply click it’s name once.)

- Place root at selected branch or node

- Swap branches of a subtree

- Resize tree to fit the window. This is especially useful with large trees.

- Resize the tree by dragging (select this option, then click and drag on the tree to resize it) Some larger trees
will take longer to resize simply because of their size.

Resolving taxa name discrepancies between a tree and a sequence file


Some analyses in MEGA will require you to specify a tree file. The tree should be a tree representative of the
sequence data, or at least contain the sequence names as taxa names in it. If the tree taxa names and the sequence data
names don’t match perfectly or the tree is the wrong size then you are given a chance to resolve these
discrepancies. The following is an example of creating an ML tree with an initial tree which doesn’t match the
sequence names.
We first specify a tree file in the analysis preferences

Next we are told that there is a Taxa Name Mismatch and we are asked how we would like to resolve this. One option
is Automatic Tree which if chosen would simply have MEGA construct a neighbor-joining tree to use as a starting
point for the heuristic search. Instead, select Use Topology Editor so that MEGA will display the Tree Topology
Editor

180
This is the dialog where we will map the taxa names from our sequence data file onto the tree. It’s important to
remember that the tree must have the same # of taxa as your sequence data. We will call the sequence data
names Active Data Names. In this example there is 1 extra taxon which will need to be removed at the end.

181
At this point 4 of the 13 taxa have been mapped. Notice that on the left hand side when a taxon has been mapped it
has an entry associated with it under the Map to User Tree Name column. Mapped taxa also show up in the tree with
black text, and no longer say .
There are two ways to map an active data name to a tree name. The first is simply dragging the active data name from
the left hand side (by clicking and dragging) and dropping it on the tree over the tree name you would like to map it to.
The second way to map taxa is to click on the space in the Map to User Tree Name which across from the Active Data
Name you wish to map. This will bring up a selection box where you can click the tree name you want to associate
with it. Below is a screen shot of the second method.

182
We will now map the rest of the taxa.

183
Noticde that one taxon on the tree which isn’t mapped, but no “Active data Names” to map to it. This is an extra
taxon in the tree, and we will delete it. Simply right click on the name and select Remove OTU.
Our mapping process is now complete! Click the “OK” button. If you are going to be using this tree file and data set
for a number of analyses, you may want to note the “Recent Trees” feature, which keeps track of these
associations. Next time you will just find the tree under “recent trees” and be done.

184
Introduction to Alignment Explorer
The Alignment Explorer provides options to (1) view and manually edit alignments and (2) generate
alignments using a built-in CLUSTALWimplementation and MUSCLE program (for the complete sequence
or data in any rectangular region). The Alignment Explorer also provides tools for exploring web-based
databases (e.g., NCBI Query and BLAST searches) and retrieving desired sequence data directly into the
current alignment.
The Alignment Explorer has the following menus in its main
menu: Data, Edit, Search, Alignment, Display, Web, Sequencer, and Help. In addition, there
are Toolbars that provide quick access to many Alignment Explorer functions. The main Alignment Explorer
window contains up to two alignment grids.
For amino acid input sequence data, the Alignment Explorer provides only one view. However, it offers two
views of DNA sequence data: the DNA Sequences grid and the Translated Protein Sequences grid. These
two views are present in alignment grids in the two tabs with each grid displaying the sequence data for the
current alignment. Each row represents a single sequence and each column represents a site. A “*” character
is used to indicate site columns, exhibiting consensus across all sequences. An entire sequence may be
selected by clicking on the gray sequence label cell found to the left of the sequence data. An entire site may
be selected by clicking on the gray cell found above the site column. The alignment grid has the ability to
assign a unique color to each unique nucleotide or amino acid and it can display a background color for each
cell in the grid. This behavior can be controlled from the Display menu item found in the main menu. Please
note that when the ClustalW (and MUSCLE) alignment algorithms are initiated, they will only align the sites
currently selected in the alignment grids. Multiple sites may be selected by clicking and then dragging the
mouse within the grid. Note that all of the manual or automatic alignment procedures carried out in the

185
Protein Sequences grid will be imposed on the corresponding DNA sequences as soon as you flip to the
DNA sequence grid. Even more importantly, the Alignment Explorer provides unlimited UNDO capabilities.

You may adjust the width of the sequence name column by clicking on the line which separates the sequence
names column and the start of the data column and dragging.

Aligning Sequences

In this tutorial, we will show how to create a multiple sequence alignment from protein sequence data that
will be imported into the alignment editor using different methods. All of the data files used in this
tutorial can be found in the MEGA\Examples\ folder (The default location for Windows users
is C:\Users\UserName\Documents\MEGA7\Examples\\. The location for Mac users
is$HOME/MEGA/Examples, where $HOME is the user’s home directory).
Opening an Alignment
The Alignment Explorer is the tool for building and editing multiple sequence alignments in MEGA.
Example 2.1:
Launch the Alignment Explorer by selecting the Align | Edit/Build Alignment on the launch bar of
the main MEGA window.
Select Create New Alignment and click Ok. A dialog will appear asking “Are you building a DNA
or Protein sequence alignment?” Click the button labeled “DNA”.
From the Alignment Explorer main menu, select Data | Open | Retrieve sequences from File. Select
the "hsp20.fas" file from the MEG/Examples directory.

Aligning Sequences by ClustalW


You can create a multiple sequence alignment in MEGA using either the ClustalW or Muscle algorithms.
Here we align a set of sequences using the ClustalW option.
Example 2.2:
Open the alignment file (using the instructions above) hsp20.fas.
Select the Edit | Select All menu command to select all sites for every sequence in the data set.
Select Alignment | Align by ClustalW from the main menu to align the selected sequences data
using the ClustalW algorithm. Click the “Ok” button to accept the default settings for ClustalW.
Once the alignment is complete, save the current alignment session by selecting Data | Save
Session from the main menu. Give the file an appropriate name, such as "hsp20_Test.mas". This
will allow the current alignment session to be restored for future editing.
Exit the Alignment Explorer by selecting Data | Exit Aln Explorer from the main menu.

Aligning Sequences Using Muscle


Here we describe how to create a multiple sequence alignment using the Muscle option.
Example 2.3:
Starting from the main MEGA window, select Align | Edit/Build Alignment from the launch bar.
Select Create a new alignment and then select DNA.
From the Alignment Explorer window, select Data | Open | Retrieve sequences from a file and
select the “Chloroplast_Martin.meg” file from the MEGA/Examples directory.
On the Alignment Explorer main menu, select Edit | Select All.

186
On the Alignment Explorer launch bar, you will find an icon that looks like a flexing arm. Click on
it and select Align DNA.
Near the bottom of the MUSCLE - AppLink window, you will see a row called Alignment Info. You
can read information about the Muscle program.
Click on the Compute button (accept the default settings). A Progress window will keep you
informed of Muscle alignment status. In this window, you can click on the Command Line
Output tab to see the command-line parameters which were passed to the Muscle program. Note:
The analysis may complete so fast, that you won’t be able to click on this tab or read it. The
information in this tab isn’t essential, it’s just interesting.
When the Muscle program has finished, the aligned sequences will be passed back to MEGA and
displayed in the Alignment Explorer window.
Close the Alignment Explorer by selecting Data | Exit Aln Explorer. Select No when asked if you
would like to save the currentalignment session to file.

Obtaining Sequence Data from the Internet (GenBank)


Using MEGA’s integrated web browser you can fetch GenBank sequence data from the NCBI website if
you have an active internet connection.
Example 2.4:
From the main MEGA window, select Align | Edit/Build Alignment from the main menu.
When prompted, select Create New Alignment and click ok. Select DNA
Activate MEGA’s integrated browser by selecting Web | Query Genbank from the main menu.
When the NCBI: Nucleotide site is loaded, enter CFS as a search term into the search box at the top
of the screen. Press theSearch button.
When the search results are displayed, check the box next to any item(s) you wish
to import into MEGA.
If you have checked more than one box: locate the Display Settings dropdown (located
near the top left hand side of the page directly under the tab headings). Change the value
to FASTA (Text) and click the Apply button. This will output all the sequences you selected
as a text in the FASTA format.
Press the Add to Alignment button (with the red + sign) located above the web address bar. This
will import the sequences into the Alignment Explorer.
With the data now displayed in the Alignment Explorer, you can close the Web Browser window.
Align the new data using the steps detailed in the previous examples.
Close the Alignment Explorer window by clicking Data | Exit Aln Explorer. Select No when asked
if you would like the save the current alignment session to file.
Note: We have aligned some sequences and they are now ready to be analyzed. Whenever you need to
edit/change your sequence data, you will need to open it in the Alignment Editor and edit or align it there.
Then export it to the MEGA format and open the resulting file.

Aligning Coding Sequences via Protein Sequences


MEGA provides two convenient methods for aligning coding sequences based on the alignment of protein
sequences. In order to accomplish this using the first method, you use the Alignment Explorer to load a data
file containing protein-coding sequences. If you click on theTranslated Protein Sequences tab you will see
that the protein-coding sequences are automatically translated into their respective protein sequence. With
this tab active select the Alignment|Align by ClustalW menu item or click on the “W” tool bar icon to begin
the alignment of the translated protein sequences. Once the alignment of the translated protein sequences
187
completes, click on the DNA Sequences tab and you’ll find that Alignment Explorer automatically aligned
the protein-coding sequences according to the aligned translated protein sequences. Any manual adjustments
made to the translated protein sequence alignment will also be reflected in the protein-coding sequence tab.
Using the second method, select the Alignment|Align by ClustalW (Codons) menu item after loading the
sequence data in Alignment Explorer. Optionally, if MEGA detects that active sequence data may be
protein-coding, clicking the “W” tool bar icon will display a drop down menu for selecting either a DNA or
Coding alignment.

Toolbars in Alignment Explorer

Basic Functions
This prepares Alignment Builder for a new alignment. Any sequence data currently loaded into Alignment Builder is dis

This activates the Open File dialog window. It is used to send sequence data from a properly formatted file into Alignm

This activates the Save Alignment Session dialog window. It may be used to save the current state of the Alignment Bui

This causes nucleotide sequences currently loaded into Alignment Builder to be translated into their respective amino ac

Web Browser/Data Explorer Functions


This displays the NCBI BLAST web site in the integrated Web Browser window. If a sequence in the sequence grid is s
sequence data.
This displays the default database (GenBank) in the integrated Web Browser window.

This activates the Open Trace File dialog window, which may be used to open and view a sequencer file. The sequence

Alignment Functions
This displays the ClustalW parameters dialog window, which is used to configure ClustalW and initiate the alignment o
appear asking if you would like to select all of the currently loaded sequences.
This displays the MUSCLE parameters dialog window, which is used to configure MUSCLE and initiate the alignment
appear asking if you would like to select all of the currently loaded sequences.
This marks or unmarks the currently selected single site in the alignment grid. Each sequence in the alignment may hav
then aligning them using the Align Marked Sites function.
This button aligns marked sites. Two or more sites must be marked in order for this function to have an effect.

Search Functions
This activates the Find Motif search box. When this box appears, it asks you to enter a motif sequence (a small subsequ
occurrence of the search term and indicates it with yellow highlighting. For example, if you were to enter the motif “AG
highlighted in yellow.
This searches towards the beginning of the current sequence for the first occurrence of the motif search term. If no moti

This searches towards the end of the current sequence for the first occurrence of the motif search term. If no motif searc

This locates the marked site in the current sequence. If no site has been marked, a warning box will appear.

Editing Functions
This undoes the last Alignment Builder action.

This copies the current selection to the clipboard. It may be used to copy a single base, a block of bases, or entire sequen

This removes the current selection from the Alignment Builder and sends it to the clipboard. This function can affect a s

188
This pastes the contents of the clipboard into the Alignment Builder. If the clipboard contains a block of bases, it will be
they will be added to the current alignment. For example, if the contents of a FASTA file were copied to the clipboard f
This deletes a block of selected bases from the alignment grid.

This deletes gap-only sites (sites containing a gap across all sequences in the alignment grid) from a selected block of b

Sequence Data Insertion Functions


This creates a new, empty sequence row in the alignment grid. A label and sequence data must be provided for this new

This activates an Open File dialog box that allows for the selection of a sequence data file. Once a suitable sequence da
grid.
Site Number display on the status bar
Site # The Site # field indicates the site represented by the current selection. If the w/o Gaps radio button is selected, then the
selected, then this field will contain the site # for the first site in the block. If an entire sequence is selected this field wil

Menu Items

Alignment Menu (in Alignment Explorer)

This menu provides access to commands for editing the sequence data in the alignment grid. The commands
are:
Align by ClustalW: This option is used to align the DNA or protein sequence included in the current
selection on the alignment grid. You will be prompted for the alignment parameters (which are context
sensitive for DNA or Protein sequence data) to be used in ClustalW; to accept the parameters, press “OK”.
This initiates the ClustalW alignment system. Alignment Builder then aligns the current selection in the
alignment grid using the accepted parameters.
Align by ClustalW (Codons): This option is used to align (via ClustalW) the coding sequence data in the
current selection by first translating all codons to amino acids, performing the alignment, and finally
replacing the amino acids with the original codons.
Align by MUSCLE: This option is used to align the DNA or protein sequence included in the current
selection on the alignment grid. You will be prompted for the alignment parameters (DNA or Protein) to be
used in MUSCLE; to accept the parameters, press “OK”. This initiates the MUSCLE alignment
system. Alignment Builder then aligns the current selection in the alignment grid using the accepted
parameters.
Align by MUSCLE (Codons): This option is used to align (via MUSCLE) the coding sequence data in the
current selection by first translating all codons to amino acids, performing the alignment, and finally
replacing the amino acids with the original codons.
Mark/Unmark Site: This marks or unmarks a single site in the alignment grid. Each sequence in the
alignment may only have one site marked at a time. Modifications can be made to the alignment by marking
two or more sites and then aligning them using the Align Marked Sites function.
Align Marked Sites: This aligns marked sites. Two or more sites in the alignment must be marked for this
function to have an effect.
Unmark All Sites: This item unmark all currently marked sites across all sequences in the alignment grid.
Delete Gap-Only Sites: This item deletes gap-only sites (site columns containing gaps across all sequences)
from the alignment grid.
Auto-Fill Gaps: If this item is checked, then the Alignment Builder will ensure that all sequences in the
alignment grid are the same length by padding shorter sequences with gaps at the end.

Display Menu (in Alignment Explorer)

189
This menu provides access to commands that control the display of toolbars in the alignment grid. The commands in
this menu are:
Toolbars: This contains a submenu of the toolbars found in Alignment Explorer. If an item is checked, then its toolbar
will be visible within the Alignment Explorer window.
Columns: This contains a submenu for toggling the display of species names and groups columns. If an item is
checked, then its column will be shown.
Use Colors: If checked, Alignment Explorer displays each unique base using a unique color indicating the base type.
Background Color: If checked, then Alignment Explorer colors the background of each base with a unique color that
represents the base type.
Toggle Conserved Sites: Toggles on/off the display of background color for sites with a given percent of
conservation.
Font: The Font dialog window can be used to select the font used by Alignment Explorer for displaying the sequence
data in the alignment grid.

Edit Menu (in Alignment Explorer)

This menu provides access to commands for editing the sequence data in the alignment grid. The commands in this
menu are:
Undo: This undoes the last Alignment Explorer action.
Copy: This copies the current selection to the clipboard. It may be used to copy a single base, a block of bases, or
entire sequences.
Cut: This removes the current selection from the Alignment Explorer and sends it to the clipboard. This function can
affect a single base, a block of bases, or entire sequences.
Paste: This pastes the contents of the clipboard into the Alignment Explorer. If the clipboard contains a block of bases,
they will be pasted into the builder, starting at the point of the current selection. If the clipboard contains complete
sequences, they will be added to the current alignment. For example, if the contents of a FASTA file are copied from a
web browser to the clipboard, they will be pasted into the Alignment Explorer as a new sequence in the alignment.
Delete: This deletes a block of selected bases from the alignment grid.
Delete Gaps: This deletes gaps from a selected block of bases.
Insert Blank Sequence: This creates a new, empty sequence row in the alignment grid. A label and sequence data
must be provided for this new row.
Insert Sequence From File: This activates an Open File dialog box that allows for the selection of a sequence data
file. Once a suitable sequence data file is selected, its contents will be imported into Alignment Explorer as new
sequence rows in the alignment grid.
Select Site(s): This selects the entire site column for each site within the current selection in the alignment grid.
Select Sequences: This selects the entire sequence for each site within the current selection in the alignment grid.
Select all: This selects all of the sites in the alignment grid.
Allow Base Editing: If this item is checked, it changes the base values for all cells in the alignment grid. If it is not
checked, then all bases in the alignment grid are treated as read-only.
Modify All Bases to Uppercase: Changes any bases written in lowercase to uppercase.

Data Menu (in Alignment Explorer)

This menu provides commands for creating a new alignment, opening/closing sequence data files, saving alignment
sessions to a file, exporting sequence data to a file, changing alignment sequence properties, reverse complementing
sequences in the alignment, and exiting Alignment Explorer. The commands in this menu are:
Create New Alignment: This tells Alignment Explorer to prepare for a new alignment. Any sequence data currently
loaded into Alignment Builder is discarded.
Open: This submenu provides two options: opening an existing sequence alignment session (previously saved
from Alignment Explorer), and reading a text file containing sequences in one of many formats (including, MEGA,
PAUP, FASTA, NBRF, etc.). Based on the option you choose, you will be prompted for the file name that you wish
to read.
Reopen: Displays a list of recently opened files that can be activated in Alignment Explorer.
Close: This closes the currently active data in the Alignment Explorer.

190
Phylogenetic Analysis: Clicking this item will prepare the data in the active sequence alignment for further analysis
in MEGA so that the alignment does not have to be saved to a file on disk and then reopened for analysis in MEGA.
Save Session: This allows you to save the current sequence alignment to an alignment session. You will be requested
to give a file name to write the data to.
Export Alignment: This allows you to export the current sequence alignment to a file. There are three formats to
choose from: MEGA, FASTA or PAUP/NEXUS formats. You will be requested to give a file name to write the data
to.
DNA Sequences: Use this item to specify that the input data is DNA. If DNA is selected, then all sites are treated as
nucleotides. The Translated Protein Sequences tab contains the protein sequences. If the data is non-coding, then
ignore the second tab, as it has no affect on the on the DNA sequence tab. However, any changes you make in
the Protein Sequence tab are applied to the DNA Sequences tab window. Note that you can UNDO these changes by
using the undo button.
Protein Sequences: Use this item to specify that the input data is amino acid sequences. If selected, then all sites are
treated as amino acid residues.
Translate/Untranslate: This item only will be available if protein-coding DNA sequences are available in the
alignment grid. It will translate protein-coding DNA sequences into their respective amino acid sequences using the
selected genetic code table.
Select Genetic Code Table: This displays the Select Genetic Code dialog window, which can select the genetic code
table that is used when translating protein-coding DNA sequence data.
Reverse Complement: This becomes available when an entire sequence of row(s) is selected. It will update the
selected rows to contain the reverse compliment of the originally selected sequence(s).
Exit AlnExplorer: This closes the Alignment Explorer window and returns to the main MEGA application
window. When selected, a message box appears asking if you would like to save the current alignment session to a
file. Then a second message box appears asking if you would like to save the current alignment to a MEGA file. If the
current alignment is saved to a MEGA file, a third message box will appear asking if you would like to open the
saved MEGA file in the main MEGA application.

Search Menu (in Alignment Explorer)

This menu allows searching for sequence motifs and marked sites. The commands in this menu are:
Find Motif: This activates the Find Motif search box. When this box appears, it asks you to enter a motif sequence (a
small subsequence of a larger sequence) as the search term. After you enter the search term, the Alignment
Explorer finds each occurrence of it and indicates it with yellow highlighting. For example, if you enter the motif
“AGA” as the search term, then each occurrence of “AGA” across all sequences in the sequence grid would be
highlighted in yellow.
Find Next: This searches for the first occurrence of the motif search term towards the end of the current sequence. If
no motif search has been performed prior to clicking this button, the Find Motif search box will appear.
Find Previous: this search towards the beginning of the current sequence for the first occurrence of the motif search
term. If no motif search has been performed prior to clicking this button, the Find Motif search box will appear.
Find Marked Site: This locates the marked site in the current sequence. If no site has been marked for this sequence,
a warning box will appear.
Highlight Motif: If this item is checked, then all occurrences of the text search term (motif) are highlighted in the
alignment grid.

Sequencer Menu (in Alignment Explorer)

Edit Sequencer File: This item displays the Open File dialog box used to open a sequencer data file. Once opened,
the sequencer data file is displayed in the Trace Data File Viewer/Editor. This editor allows you to view and edit trace
data produced by the automated DNA sequencer. It reads and edits data in ABI and Staden file formats and the
sequences displayed can be added directly into the Alignment Explorer or send to the Web Browser for
conducting BLAST searches.

Web Menu (in Alignment Explorer)


191
This menu provides access to commands for querying GenBank and doing a BLAST search, as well as access to the
MEGA web Browser. The commands in this menu are:
Query Gene Banks: This item starts the Web Browser and accesses the NCBI home page
(http://www.ncbi.nlm.nih.gov).
Do BLAST Search: This item starts the Web Browser and accesses the NCBI BLAST query page. If you select a
sequence in the alignment grid prior to selecting this item, the web browser will automatically copy the selected
sequence data into the search field.
Show Browser: This item will show the Web Browser.

Appendix A: Frequently Asked Questions

Computing statistics on only highlighted sites in Data Explorer

Go to the Statistics menu in the Sequence Data Explorer, and click on Use highlighted sites only. Now all
statistical quantities computed using the Statistics menu will be based only on the highlighted sites.

Finding the number of sites in pairwise comparisons

If you want to find the number of sites between pairs of sequences or the average number of sites, then go to
the Distance menu and select the desired distance type. Then in Substitutions to Include, select an option
regarding the number of sites.

Get more information about the codon based Z-test for selection

The codon based Z-test for selection can be done in two places. First, you can use the Tests | Codon Based tests of
selection | Z-test (large sample)option to find the probability that the null hypothesis will be rejected, in addition to the
actual value of the Z-statistic. Alternatively, if you want to know the difference between s and n
(synonymous and nonsynonymous substitutions and their variance, you can go to the Distances | Pairwise menu
option and in the distance computation dialog, select an appropriate method (e.g., Nei-Gojobori method) and then
choose s-n (or n-s depending on your need) from the Substitutions to include menu. Also, you can choose to compute
standard error.

Menus in MEGA are so short; where are all the options?

Our aim in developing the objectively driven user-interface of MEGA has been a clutter-free work environment that
asks the user for information on a need-to-know basis Although this modular analytical tool looks simple, behind each
menu item is a wide range of useful options and tools that come with enhancements that are designed to reduce the
amount of time needed for mundane non-technical tasks. Consider, for example, the Sequence Data Explorer. This
unique module is hidden away when you don't want it but is always working behind the scenes. It allows you to view
the data in various ways, export data subsets, and compute many important basic statistical quantities. Another
interesting module is the Genetic Code selector, which allows you to choose the depth at which you wish to work
with a code table. With it you can select a desired code table, add new data to and edit the existing code table, view
the selected code table in a conventional format, compute the degeneracy for each site in every codon, and compute
the number of potentially synonymous and nonsynonymous sites for each codon. In addition, you can always find
help by checking the help index.

192
Writing only 4-fold degenerate sites to an output file

All sequence data subset facilities are accessible through the Export Data command in the Sequence Data
Explorer. To write 4-fold degeneratesites to a file, highlight the 4-fold degenerate sites on the screen and then
select Export Data. In that command, choose to write only the highlighted sites. For example, if you select to write
only the third codon positions, all 4-fold degenerate sites found in the third codon positions will be written to the
file.

Why can't I display branch lengths for a bootstrap consensus tree?

MEGA does not computer or provide branch lengths for bootstrap consensus trees as they generally contain
multi-furcations due to the partition frequency cutoffs. Estimates of branch lengths in these cases are not
correct as the collapsed branches have non-zero lengths in reality but they are not statistically resolved (i.e.
lack of significance by the bootstrap method).

Divergence Times in the Outgroup Cluster are not Visible

When calibration constraints are used in the Reltime analysis, divergence times are not displayed in the
outgroup clade because the Reltime method uses evolutionary rates from the ingroup to calculate divergence
times. The method does not assume that evolutionary rates in the ingroup clade apply to the outgroup.

Appendix B: main Menu Items and Dialogs Reference

Main MEGA Window

The main window in MEGA contains a menu bar, a main toolbar (just beneath the menu bar), a secondary toolbar near
the bottom of the window, and a bottom status bar.
Menu Bar

Menus: Description
File Use the File menu commands to open data for analysis, edit text files, convert file formats, and exit
MEGA.
Analysis Use the Analysis menu to launch the analyses available in MEGA.
Help menu Use the Help menu to access the online help system, which is displayed in a special help window.

Main Toolbar
This toolbar contains logically organized menus for launching the analyses available in MEGA as well as for
importing/exporting data.

Align Edit and build sequence alignments, view/edit sequencer files, query online data banks, do BLAST search,
and launch the MEGA Web Browser.
Data Open data and session files, explore active data, export active data, save active data to a session file, select
genetic code table, select/edit genes and domains, select/edit taxa and groups.
Models Launch analyses related to substitution models, such as best-fit model selection, pattern heterogeneity
tests, estimation of substitution matrix and transition/transversion bias, calculate codon usage bias and
composition statistics.
Distance Compute evolutionary distances: pairwise, overall mean, within group mean, between group mean, and
net between group mean.
Diversity Compute mean diversity: within sub-populations, in entire population, between populations. Also,
compute coefficient of differentiation.

193
PhylogenyConstruct/test phylogenies using Maximum Likelihood, Maximum Parsimony, Neighbor-Joining,
Minimum Evolution, and UPGMA. Also, open saved tree sessions.
User Tree Analyze a given tree using Maximum Likelihood, Maximum Parsimony, or Ordinary Least Squares.
Display Newick trees, or edit/draw trees manually.
Ancestors Infer ancestral states using Maximum Likelihood or Maximum Parsimony.
Selection Estimate selection for each codon using HyPhy, perform codon-based Z-test of selection, codon-based
Fisher’s exact test of selection, or Tajima’s test of neutrality.
Rates Using Maximum Likelihood, estimate gamma shape parameter for site rates or estimate position-by-
position rates.
Clocks Perform Tajima’s relative rate test, test for molecular clock, or compute a time tree using the Reltime
Maximum Likelihood method.
Diagnose Explore the functional impact of non-synonymous single nucleotide variants (nSNVs).

Secondary Toolbar
This toolbar contains items that are not suitable for the main toolbar.

First time user?A very brief description of using MEGA.


Tutorial A collection of tutorials for learning how to perform common tasks when using MEGA.
Examples A collection of example data files that are used in the tutorials.
Citation How to cite MEGA.
Report a Bug Help improve MEGA by filing a bug report with the authors.
Updates? Check to see if there is a newer version of MEGA available.
MEGA Links Links to resources related to MEGA.
Toolbar Customize the main toolbar in MEGA.
Preferences Customize user preferences.

Alignment Menu

Align Menu
This menu provides access to options for viewing and building DNA and protein sequence alignments and for
exploring the web based databases (e.g., NCBI Query and BLAST searches) in the MEGA environment.

Query Databanks

Align | Query Databanks


Use this to open the MEGA web-browser to search the NCBI and other web sites for sequence data.

Show Web Browser

Align | Show Web Browser


Use this option to launch the MEGA Web Browser.

View/Edit Sequencer Files

Align | Edit/View Sequencer Files (Trace)…


Use this option to view/edit the sequence data in ABI (*.abi and .ab1) and Staden (.scf) files. The Alignment
Explorer provides this option directly.

Data Menu

194
This allows you to explore the active data set, and establish various data attributes, and data subset options. It also
allows you to perform various important tasks, including activating a data file, editing text files, and exiting MEGA.

Open A File

Data | Open A File/Session…

When you choose this option you will be prompted to select a file to load into MEGA. You may hit cancel if you
don’t wish to load a file yet. Once you have selected the file MEGA will determine the type of file you have selected
by it’s extension (eg. .nwk, .meg, .msdx, etc.)
If there is any issue with the file such as improper format of the data, or the data being corrupt MEGA will alert you of
the issue.

Reopen Data

File | Reopen Data


This reopens a recently closed data file from the submenu, which shows the names of the five most recently used
data files.

Save Data Session To File

Data | Save Data Session To File

This saves all the information about the data you are currently working on (not results of calculations though) so it
may later be resumed. Read further about session saving.

Export Data

Data | Export Data


This command activates the appropriate input data explorer, presents a dialog box for specifying options and a file
for writing the currently active data subset in a chosen format.

Close Data

This deactivates the currently open data file. Before issuing this command, save any modifications that you wish to
retain by using Session Saving (Data | Save Session).
This command is enabled only if a dataset is loaded in MEGA.

Data Explorer

Data | Data Explorer


Data Explorers used to view the currently active data set, calculate its basic statistical attributes, export it in formats
compatible with other programs, and define subsets for analysis. Depending on the currently active data type, one of
the following explorers will be available:

Data Type Explorer


195
DNA, RNA, Protein sequences Sequence Data Explorer
Evolutionary divergence Distance Data Explorer

Setup/Select Genes & Domains

Data | Select Genes & Domains


The Setup/Select Genes & Domains dialog box allows you to view, specify, and edit genes and domains and to label
sites.

Setup/Select Taxa & Groups

Data | Setup/Select Taxa & Groups

This invokes the Setup/Select Taxa & Groups dialog box for including or excluding taxa, defining groups of taxa, and
editing names of taxa and groups.

Printer Setup

Data | Printer Setup


Choose this command to change the properties of your printer.

Exit

Data | Exit
This command closes the currently active data file and all other windows. If you want to save changes to the data set
displayed on the screen, before issuing this command you must choose File | Export Data and Print or Save. Note
that MEGA does not automatically save changes made to active data to the original data file.

Distances Menu

Distances Menu
Use this menu to compute: pairwise and average distances between sequences; within, between, and net average
distances among groups; and sequence diversity statistics for data from multiple populations.

Compute Pairwise

Distances | Compute Pairwise…


Choose this to compute the distances and standard errors between pairs of taxa. A Select Distance Options dialog, in
which you can choose the desired distance estimation method and other relevant options, will appear.

Compute Overall Mean

Distances | Compute Overall Mean…


This calculates the mean pairwise distance and standard error for the set of sequences under study. The overall mean is
the arithmetic mean of all individual pairwise distances between taxa. A Select Distance Options dialog, in which you
can choose the desired distance estimation method and other relevant options, will appear. Before using the bootstrap
method to compute standard error, please read how MEGA implements the bootstrap method for this purpose.

196
Compute Within Groups Mean

Distances | Compute Within Groups Means…


This computes the mean pairwise distances within groups of taxa. The within group means are arithmetic means of all
individual pairwise distances between taxa within a group. A Select Distance Options dialog, in which you can
choose the desired distance estimation method and other relevant options, appears. You must have at least one group
of taxa, with a minimum of two taxa defined, to utilize this option.

Compute Between Groups Means

Distances | Compute Between Groups Means…


This computes the average distances between groups of taxa. The average distance is the arithmetic mean of all
pairwise distances between two groups in the inter-group comparisons. A Select Distance Options dialog, in which
you can choose the desired distance estimation method and other relevant options, will appear. You must have at least
two groups of taxa for this option to work.

Compute Net Between Groups Means

Distances | Compute Net Between Groups Means…


This command computes the net average distances between groups of taxa. The net average distance between two
groups is given by
dA = dXY – ((dX + dY)/2)
Where, dXY is the average distance between groups X and Y, and dX and dY are the mean within-group
distances. A Select Distance Options dialog, in which you can choose the desired distance estimation method and
other relevant options, will appear.
You must have at least two groups of taxa with a minimum of two taxa each for this option to work.
How to define groups of taxa.

Diversity Menu

Compute Sequence Diversity

Distances | Compute Sequence Diversity


TheSequence Diversity submenu provides four commands for computing the population and subpopulation diversities
that are useful in molecular population genetics studies. First, you define a group, using a population of
sequences. Unlike the generic averages of within group, between group, and net between group distances calculated
using other commands in the Distances menu, formulas used in the following commands are those used specifically in
population genetics analyses.
The commands are:
Mean Diversity within Subpopulations
In a subpopulation, the mean diversity is defined as

where xi is the frequency of i-th sequence in the sample from subpopulation i, and q is the number of different
sequences in this subpopulation.
Mean Diversity for Entire Population
For the entire population, the mean diversity is defined as

197
where xi is the estimate of average frequency of the i-th allele in the entire population, and q is the number of
different sequences in the entire sample.
Mean Interpopulational Diversity
The estimate of inter-populational diversity is given by
deltaST = RT - RS
Coefficient of Differentiation
The estimate of the proportion of interpopulational diversity is given by
NST = deltaST/RT

Models Menu

Find Best DNA/Protein Models (ML)

Models | Find Best DNA/Protein Models (ML)

This option tests a data file (nucleotide or amino acid) for goodness of fit to some popular models of evolution, and
returns the values of several criteria which can be used to pick the most appropriate evolutionary model for your
analysis. The results also show the estimated values of all parameters for each model
(frequencies, transition probabilities, rate variation parameters, etc), plus the count of total parameters. In most cases
you would pick a model that has a low number of parameters (to keep variance low) yet is accurate enough (as
measured by the goodness-of-fit criteria) for your needs.

Disparity Index Test


Tests the null hypothesis that sequences have evolved with the same pattern of substitution, as judged from the extent
of differences in base composition biases between sequences (Kumar and Gadagkar 2001).

Estimate Substitution Matrix (ML)

Models | Estimate Substitution Matrix

This option estimates and displays the nucleotide substitution rate matrix using the Maximum Likelihood method for
the current data set and evolutionary model selected. This method finds the set of values for the substitution rate
matrix parameters that maximizes the probability (likelihood) of the data. This is applicable only to nucleotide data
(coding or non-coding).

Estimate Transition/Transversion Bias (ML)

Models | Estimate Transition/Transversion Bias (ML)

This option estimates the Transition/Transversion bias parameters κ, κ1, and κ2 using the Maximum Likelihood
method. κ is used as a parameter of the Kimura Two-Parameter model of nucleotide evolution and some others, while
κ1 and κ2 are used by the Tamura-Nei 93 model. This is applicable only to nucleotide data (coding or non-coding).

198
Compute MCL Substitution Matrix

Models | Compute MCL Substitution Matrix

This option estimates and displays the substitution rate matrix for the Maximum Composite Likelihood (MCL) method
for the current data set (nucleotide data only, coding or non-coding).

Compute MCL Transversion/Transition bias

Models | Compute MCL Transversion/Transition bias

This option estimates the Transition / Transversion bias parameters κ (for purines + pyrimidines), κ1 (purines only),
and κ2 (pyrimidines only) under the Maximum Composite Likelihood model. (nucleotide data only, coding or non-
coding)

Compute Pattern Disparity Index

Disparity Index Test


Tests the null hypothesis that sequences have evolved with the same pattern of substitution, as judged from the extent
of differences in base composition biases between sequences (Kumar and Gadagkar 2001).

Compute Composition Distance


Composition distance is a measure of the difference in nucleotide (or amino acid) composition for a given pair of
sequences. It is one half the sum of squared difference in counts of bases (or residues). MEGA computes and
presents the Composition Distance per site, which is given by the total composition distance between two sequences
divided by the number of positions compared, excluding gaps and missing data.

Compute Amino Acid Composition

Statistics | Amino acid Composition


This command is visible only if the data consists of amino acid sequences or if the translated protein coding
nucleotide sequences are displayed. MEGA will compute the amino acid frequencies for each sequence as
well as an overall average, which will be displayed by domain (if domains have been defined in
Setup/Select Genes & Domains).

Compute Nucleotide Composition

Statistics | Nucleotide Composition


This command is visible only if the data consist of nucleotide sequences. MEGA computes the base
frequencies for each sequence as well as an overall average. These will be displayed by domain in a Text
Editor domain (if the domains have been defined in Setup/Select Genes & Domains).

Phylogeny Menu

199
Phylogeny Menu
Use the Phylogeny menu to construct phylogenetic trees, infer their reliability using the bootstrap and interior branch
tests, and view previously constructed trees.

Bootstrap Test of Phylogeny

Phylogeny | Construct/Test Neighbor-Joining Tree


Or
Phylogeny | Construct/Test Minimum-Evolution Tree
Or
Phylogeny | Construct/Test UPGMA Tree
Or
Phylogeny | Construct/Test Maximum Likelihood Tree
Or
Phylogeny | Construct/Test Maximum Parsimony Tree(s)

One of the most commonly used tests of the reliability of an inferred tree is Felsenstein's (1985) bootstrap test, which
is evaluated using Efron's (1982) bootstrap resampling technique. If there are m sequences, each with n nucleotides
(or codons or amino acids), a phylogenetic tree can be reconstructed using some tree building method. From each
sequence, n nucleotides are randomly chosen with replacements, giving rise to m rows of n columns each. These now
constitute a new set of sequences. A tree is then reconstructed with these new sequences using the same tree building
method as before. Next the topology of this tree is compared to that of the original tree. Each interior branch of the
original tree that is different from the bootstrap tree the sequence it partitions is given a score of 0; all other interior
branches are given the value 1. This procedure of resampling the sites and the subsequent tree reconstruction is
repeated several hundred times, and the percentage of times each interior branch is given a value of 1 is noted. This is
known as the bootstrap value. As a general rule, if the bootstrap value for a given interior branch is 95% or higher,
then the topology at that branch is considered "correct". See Nei and Kumar (2000) (chapter 9) for further details.
This test is available for four different methods: Neighbor Joining, Minimum Evolution, Maximum
Parsimony, UPGMA, and Maximum Likelihood.

Interior Branch Test of Phylogeny

Phylogeny | Construct/Test Neighbor-Joining Tree


Or
Phylogeny | Construct/Test Minimum-Evolution Tree

A t-test, which is computed using the bootstrap procedure, is constructed based on the interior branch length and its
standard error and is available only for the NJ and Minimum Evolution trees. MEGA shows the confidence
probability in the Tree Explorer; if this value is greater than 95% for a given branch, then the inferred length for that
branch is considered significantly positive. Select test of phylogeny for either of these trees in the Analysis
Preferences dialog.

See Nei and Kumar (2000) (chapter 9) for further details.

Display Saved Tree Session

Phylogeny/User Tree | Open Tree Session


Use this command to display a previously saved Tree Explorer session (saved in a filename with .MTS extension).

200
User Tree Menu

Analyze User Tree by Maximum Likelihood

User Tree Computation | Analyzer User Tree by Maximum Likelihood

This option estimates the branch lengths by the Maximum Likelihood (ML) method for a user-supplied phylogenetic
tree for the currently open sequence data set. The Log Likelihood for the tree is also shown.

Analyze User Tree by Least Squares

User Tree Computation | Analyze User Tree by Least Squares

This option estimates the branch lengths by the Ordinary Least Squares (OLS) method for a user-supplied
phylogenetic tree for the currently open sequence data set. The sum of branch lengths for the entire tree is also
shown.

Analyze User Tree by Parsimony

User Tree Computation | Analyze User Tree by Parsimony

This will test the tree which you provide, and report on how accurate the tree is in relation to the data file you have
open. The best tree with this method will be the one with the least evolutionary change required.

Display Newick Trees from File

User Tree | Display Newick Tree


Use this to retrieve and display one or more trees written in Newick format. Multiple trees can be displayed, and their
consensus built, in the Tree Explorer. MEGA supports the display of Newick format trees containing branch lengths
as well as bootstrap or other counts (note that the Newick formats do not contain the total number of bootstrap
replications conducted).

Tree Topology Editor

The tree topology editor shows a single tree's topology in a way in which you are able to modify it. The toolset is very
similar to the Tree Explorer , but with some features added and others removed.
The Topology Editor is also used during analyses in which the user is supplying their own tree. In some cases
the taxa names in the supplied tree don’t match up exactly with the names in the sequence file we are using. In these
cases users will have a chance to fix the inconsistency by mapping the sequence names to the tree names. Further
below, how mapping names works is described.
Editing the Topology of a Tree
The most basic use of the Topology Editor is to enable the user to build or edit a tree file. The editor can be launched
from the main form by clicking User Tree->Edit/Draw Tree (Manually). If you don’t have a tree file to start off with
then you can choose to either start from scratch or start with a randomly created tree based on your sequence file (this
just saves the time of adding the taxa).

201
This image is showing the Tree Topology Editor with the NJ tree for the Crab_rRNA.meg example file loaded.
Toolbar Explained

- Open a newick tree file

- Open a recently edited tree. This is especially useful in the case where you have a tree file which doesn’t
completely match a sequence file. If you have to resolve the differences MEGA remembers the tree and the mapping
of the taxa. Just select the recently edited tree next time you need to use it with that sequence file.

- Generate a new tree.

- Save to newick file format

- Change the taxa font

- Search for a taxa by name

- Copy an image of the tree to the clipboard (you can past a picture of the tree into word, or another program)

202
- Undo the last change (only applies to topology changes, not taxa name changes)

- Add a new taxa (adds on the currently selected branch, or if no branch is selected it adds to the very top
subtree)

- Delete a taxon (If a taxon is selected it will be drawn in blue, and this option will become enabled. To select a
taxa simply click it’s name once.)

- Place root at selected branch or node

- Swap branches of a subtree

- Resize tree to fit the window. This is especially useful with large trees.

- Resize the tree by dragging (select this option, then click and drag on the tree to resize it) Some larger trees
will take longer to resize simply because of their size.

Resolving taxa name discrepancies between a tree and a sequence file


Some analyses in MEGA will require you to specify a tree file. The tree should be a tree representative of the
sequence data, or at least contain the sequence names as taxa names in it. If the tree taxa names and the sequence data
names don’t match perfectly or the tree is the wrong size then you are given a chance to resolve these
discrepancies. The following is an example of creating an ML tree with an initial tree which doesn’t match the
sequence names.
We first specify a tree file in the analysis preferences

Next we are told that there is a Taxa Name Mismatch and we are asked how we would like to resolve this. One option
is Automatic Tree which if chosen would simply have MEGA construct a neighbor-joining tree to use as a starting
point for the heuristic search. Instead, select Use Topology Editor so that MEGA will display the Tree Topology
Editor

203
This is the dialog where we will map the taxa names from our sequence data file onto the tree. It’s important to
remember that the tree must have the same # of taxa as your sequence data. We will call the sequence data
names Active Data Names. In this example there is 1 extra taxon which will need to be removed at the end.

204
At this point 4 of the 13 taxa have been mapped. Notice that on the left hand side when a taxon has been mapped it
has an entry associated with it under the Map to User Tree Name column. Mapped taxa also show up in the tree with
black text, and no longer say .
There are two ways to map an active data name to a tree name. The first is simply dragging the active data name from
the left hand side (by clicking and dragging) and dropping it on the tree over the tree name you would like to map it to.
The second way to map taxa is to click on the space in the Map to User Tree Name which across from the Active Data
Name you wish to map. This will bring up a selection box where you can click the tree name you want to associate
with it. Below is a screen shot of the second method.

205
We will now map the rest of the taxa.

206
Noticde that one taxon on the tree which isn’t mapped, but no “Active data Names” to map to it. This is an extra
taxon in the tree, and we will delete it. Simply right click on the name and select Remove OTU.
Our mapping process is now complete! Click the “OK” button. If you are going to be using this tree file and data set
for a number of analyses, you may want to note the “Recent Trees” feature, which keeps track of these
associations. Next time you will just find the tree under “recent trees” and be done.

207
Display Saved Tree Session

Phylogeny/User Tree | Open Tree Session


Use this command to display a previously saved Tree Explorer session (saved in a filename with .MTS extension).

Ancestors Menu

Inferring Ancestral Sequences (ML)

Ancestral Sequences | Infer Ancestral Sequences (ML)


This option uses the Maximum Likelihood method to estimate the ancestral state of each node in a phylogenetic
tree. The state is chosen to be the one that maximizes the probability of the given sequence data under the selected
model of nucleotide or amino acid evolution. Inferring ancestral sequences using ML, on average, gives more
accurate results than using Maximum Parsimony, especially when the phylogenetic tree includes long branches.

Inferring Ancestral Sequences (MP)

Ancestral Sequences | Infer Ancestral Sequences (Parsimony)


When the sequence diversity is low Maximum Parsimony is effective at inferring the ancestral sequences. In the case
that your sequences are somewhat distant MP may produce several possible sequences, and finding the most probable
one can sometimes be difficult.
208
Selection Menu

Estimate Selection for each Codon (HyPhy)

Selection | Estimate Selection for each Codon (HyPhy)

This option estimates the strength of selection (positive or negative) operating upon each individual codon in an
alignment and provides statistical support measures of each estimate. This requires coding DNA sequence data.
For this calculation, MEGA uses a third party program called HyPhy. This is mostly transparent to you (the
user). When running the process the progress dialog will have a second tab labeled “Command Line Output”, this
contains the direct output from HyPhy as if you had run it yourself. The first line in the Command Line Output tab
contains the actual command which was run for this analysis.

Codon Based Fisher's Exact Test

Distance | Codon Based Fisher’s Exact Test


This provides a test of selection based on the comparison of the numbers
of synonymous and nonsynonymous substitutions between sequences. Use this command to conduct a small sample
test of positive selection (Zhang et al. 1997): a one-tailed Fisher’s Exact test. If the resulting P -value is less than 0.05,
then the null hypothesis of neutral evolution (strictly neutral and purifying selection) is rejected. If the observed
number of synonymous differences per synonymous site (pS) exceeds the number of nonsynonymous differences per
nonsynonymous site (pN) then MEGA sets P = 1 to indicate purifying selection, rather than positive selection.

See Nei and Kumar (2000) (page 56) for further description and an example.

Codon Based Z-Test (large sample)

Distance | Codon Based Z-test (large sample)


One way to test whether positive selection is operating on a gene is to compare the relative abundance of synonymous
and nonsynonymous substitutions that have occurred in the gene sequences. For a pair of sequences, this is done by
first estimating the number of synonymous substitutions per synonymous site (dS) and the number of nonsynonymous
substitutions per nonsynonymous site (dN), and their variances: Var(dS) and Var(dN), respectively. With this
information, we can test the null hypothesis that H0: dN = dS using a Z-test:
Z = (dN - dS) / SQRT(Var(dS) + Var(dN))
The level of significance at which the null hypothesis is rejected depends on the alternative hypothesis (HA)
H0: dN = dS
HA: (a) dN ¹dS (test of neutrality).
(b) dN > dS (positive selection).
(c) dN < dS (purifying selection).

For alternative hypotheses (b) and (c), we use a one-tailed test and for (a) we use a two-tailed test. These three tests
can be conducted directly for pairs of sequences, overall sequences, or within groups of sequences. For testing for
selection in a pairwise manner, you can compute the variance of (dN - dS) by using either the analytical formulas or the
bootstrap resampling method.
For data sets containing more than two sequences, you can compute the average number of synonymous substitutions
and the average number of nonsynonymous substitutions to conduct a Z-test in a manner similar to the one mentioned
above. The variance of the difference between these two quantities is estimated by the bootstrap method (See Nei and
Kumar (2000) page 55).

Tajima's Test of Neutrality

209
Selection | Tajima’s Test of Neutrality
This conducts Tajima’s test of neutrality (Tajima 1989), which compares the number of segregating sites per site with
the nucleotide diversity. (A site is considered segregating if, in a comparison of m sequences, there are two or more
nucleotides at that site; nucleotide diversity is defined as the average number of nucleotide differences per site
between two sequences). If all the alleles are selectively neutral, then the product 4Nv (where N is the effective
population size and v is the mutation rate per site) can be estimated in two ways, and the difference in the estimate
obtained provides an indication of non-neutral evolution. Please see Nei and Kumar (2000) (page 260-261) for further
description.

Rates Menu

Estimate Position-by-Position Rates (ML)

Rates | Estimate Position-by-Position Rates (ML)

This option uses the Maximum Likelihood method to estimate the rate of evolution at each nucleotide or protein site
of an alignment. The rate of evolution at each site is chosen so as to maximize the probability of the given alignment
sequence data under the selected model of evolution.

Estimate Gamma Parameter for Site Rates (ML)


Estimate the value of the shape parameter for the discrete Gamma distribution.

Clock Menu

Tajima's Test (Relative Rate)

Molecular Clocks | Tajima’s Relative Rate Test


Use this to conduct Tajima’s relative rate test (Tajima 1993), which works in the following way. Consider three
sequences, 1, 2 and 3, and let 3 be the out-group. Let nijk be the observed number of sites in which sequences 1, 2 and
3 have nucleotides i, j and k. Under the molecular clock hypothesis, E(nijk) = E(njik) irrespective of the substitution
model and whether or not the substitution rate varies with the site. If this hypothesis is rejected, then the molecular
clock hypothesis can be rejected for this set of sequences.
In response to this command, you can select the three sequences for conducting Tajima’s test. For nucleotide
sequences, this test offers the flexibility of using only transitions, only transversions, or both. If the data is protein
coding, then you can choose to analyze translated sequences or any combination of codon positions by clicking on the
‘Data for Analysis’ button.

See Nei and Kumar (2000) (page 193-196) for further description and an example.

Molecular Clock Test (ML)

Clocks | Test Molecular Clock(ML)

This option performs a Maximum Likelihood test of the molecular clock hypothesis for a given tree topology and
sequence alignment. (The “Molecular Clock Hypothesis” means that all tips of the tree are equidistant from the root
of the tree.) two log-likelihood values are calculated and displayed, one with and one without the clock
hypothesis. The latter will always be larger (note that the numbers are negative, so “larger” means “smaller in
absolute value”). The statistical significance of the difference may be tested by comparing twice the difference in log-
likelihood values to a chi-squared threshold value with s-2 degrees of freedom, where s is the number of sequences in
the alignment.

210
Constructing a Timetree (ML)

This example shows how to generate a timetree in MEGA. For this analysis, MEGA uses a Timetree
Wizard window which will walk you through the necessary steps. The data files used in this example can
be found in the MEGA/Examples folder (The default location for Windows users
is C:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is$HOME/MEGA/Examples, where $HOME is the user’s home directory).
Setting up the analysis
From the main MEGA window, select Clocks | Compute Time Tree | RelTime-ML. The Timetree
Wizard window, which outlines the 6 steps for creating a timetree in MEGA will be displayed.
Step1: First, we will load a sequence alignment file. In the Timetree Wizard window, click
the Browse... button and then using the file open dialog, find and select the “mtCDNA.meg” sequence
alignment file. After the alignment file is parsed by MEGA, the Load Tree File action in step 2 will
become enabled.
Step 2: Second, we will load the newick tree file which gives the topology for our timetree. Click
the Browse … button and using the file open dialog that is displayed, find and select the “mtCDNA.nwk”
tree file. After this file is parsed and validated against the sequence alignment begin used, step 3 will
become enabled.
Step 3: Next, we need to specify an outgroup taxon (we will specify one but multiple taxa can be in the
outgroup). Click the SelectTaxa… button and the Taxa/Groups window will be displayed with all taxa in
our data listed in the Ungrouped Taxa list box(alternatively you can click the Select Branch… button and
use the Tree Explorer to specify the outgroup). Select the gibbon taxonand move it from
the Ungrouped Taxa list box to the Taxa in Outgroup list box by clicking the left-pointing arrow. Click
the Closebutton to save your changes and exit the Taxa/Groups dialog.
Step 4: Now, an option to specify divergence time calibrations constraints will become available (if this
step is skipped, then only relative times of divergence will be calculated). Click the Add
Constraints… button. MEGA will display the Calibration Editor windowthat is used for specifying
divergence time constraints in the timetree.
First, we will create a divergence time calibration constraint by specifying two taxa whose most recent
common ancestor is the nodefor which the time constraint applies. In the Calibration Editor window,
select the Calibration | Calibrate MRCA menu item (or click the add new constraint button on the upper
left toolbar [it looks like a clock with a plus sign on the bottom right]). This will create a new calibration
constraint with a default name. From the Taxon A and Taxon B dropdown lists select chimpanzee and
bonobo. TheCalibration Name edit box and the MRCA Node Label edit box are populated with default
names but you can edit these if you like. The MRCA node label is especially useful for interpreting the
tabular Timetree output produced by MEGA’s Timetree system so that you can quickly identify calibrated
nodes by name instead of by node number. In the Min Divergence Time edit box enter 1.2. In theMax
Divergence Time edit box enter 5.0.
Next, we will create another calibration constraint by selecting a node in the tree display. In the tree
display, select the node whose descendents are orangutan and sumatran (click this node to select it.
It will then have a red diamond around it when it is selected). Select Calibration | Calibrate Selected
Node menu item (or on the upper-right toolbar, click the new divergence time constraint button [it also
looks like a clock but has a plus sign on its lower-left instead of lower-right]). This will create a new
calibration. Nowtype 13.0 in the Max Divergence Time edit box. Leave the Min Divergence Time Edit box
blank. Click the Finished button to complete step 4.
Step 5: Next, we can set several analysis settings such as substitution model, treatment of missing data,
etc… Back in the Timetree Wizard window, click the Set Analysis Options… button in order to open
the Analysis Preferences dialog. Click the Save button to use the default settings.

211
Step 6: Finally, in the Timetree Wizard window, click the Execute button. Progress will be displayed as
the analysis runs. When the analysis completes, the Tree Explorer window will return and display the time
tree.

Viewing the results


In the Tree Explorer window, the calculated timetree will be displayed with absolute times of divergence
for all branching points in the tree shown. Blue diamonds indicate those nodes which were used to
calibrate the tree. To display node height error bars, click View | Show/ Hide | Node Height Error
Bars (if branch lengths are also shown, you can hide them by clicking View | Show/Hide | Branch
Lengths).
Select File | Export Current Tree (Timetree) and MEGA’s text editor will be displayed with a
description of the tree in tabular format.
Go back to the Tree Explorer window and select View | Show/Hide | Node Ids from the main menu. Now
the divergence times are no longer shown but node Ids are shown. These correspond to the Node Ids in
the tabular description of the timetree in the text editor.
The first column in the text editor has node labels. The one specified in the Calibration Editor is there and
there is another one which was contained in the mtCDNA.nwk file. Open the mtCDNA.nwk file using
the text editor and find this node label.
Select View | Show/Hide | Data Coverage and the data coverage for each internal node in the tree will be
displayed.

Diagnose Menu

Diagnose Mutations

Forecast the deleteriousness of nsSNVs using multiple methods and explore them in the context of the
variability permitted in the long-term evolution of the affected positions.

MEGA Dialogs
Input Data Format Dialog

The Input Data Format dialog is displayed if MEGA does not find enough information about the type of data included
in the input file.
Data Type
This displays the list of data types that MEGA is able to analyze. Highlight the current data type by clicking on
it. Depending on the type of data selected, you may need to provide information about the following additional items.

For Sequence Data


· Missing Data
Character used to show missing data in the data file; it should be set to a question mark (?).
· Alignment Gap
Character used to represent gaps inserted in the multiple sequence alignment; it is set to a dash (-) by default.
· Identical Symbol
Character used to represent identity with the first sequence in the data files; it is set to a dot (.) by default.

For Pairwise Distance Data


· Missing Data
Character used to show missing data in the data file; it should be set to a question mark (?).
· Matrix Format
Choose the lower-left or upper-right distance matrix for the pairwise distance data type.

212
Note: To avoid having to answer these questions every time you read your data file, save the data by exporting it
in MEGA format.

Setup/Select Genes & Domains Dialog

Use the Gene & Domain Editor to inspect, define, and select domains, and genes, and labels for individual sites.

The Genes & Domains dialog consists of two tabs: Define/Edit/Select and Site Labels.
Define/Edit/Select tab
This tab contains a hierarchical listing of gene and domain names with the corresponding information organized into
four columns for amino acid sequences and six columns for nucleotide sequences.

Gene and domain name listing


Each line in this display contains a small 'expand/contract' box, a checkbox, a gene/domain icon, and the name of
the gene or domain. The 'expand/contract' box allows you to display or hide the information below a given gene. The
checkbox shows if the gene or domain is currently selected for analysis. All defined genes and domains appear below
the Genes\Domain node in the hierarchy. All domain names are shown with a yellow
background. The Independent node shows the number of Independent sites, which are not assigned to any domains or
genes.

If your input data file does not contain any domains, then MEGA automatically creates a domain called Data. If you
wish to create new domains, you should delete the Data domain to make all sites independent. Remember that
only independent sites can be assigned to domains, and sites cannot be assigned to multiple domains. Genes are
simply collections of domains, and thus gene boundaries are decided based on the domains contained in
them. The MEGA gene and domain organizer is flexible and is designed to enable you to specify genes
and domains as they appear in a genome. For instance, a sequence may contain one or more genes, each of which may
contain one or more domains. In between genes, there may be inter-genetic domains. In addition, within or between
genes or domains, there may be sites that are not members of any domain.

At the bottom of this tab, you will find a toolbar with many drop-down menu buttons, which can be used
to Add/Insert new genes or domains. The add and insert operations differ in the following way. If you add a gene or
domain, then the new gene or domain will be added at the end of the list to which the currently focused gene or
domain belongs. If you insert a gene (or domain), it will be inserted by shifting all the following genes
ordomains down. Add and Insert commands are context sensitive.
You can rearrange the relative position of genes and domains by drag-and-drop operations.

Inspecting/modifying attributes of genes and domains


When you start, all genes and domains are shown. Click on the ‘+’ in the expand/contract box to expand the listing
for each gene to its domains. Click on the ‘-‘ to collapse to the gene. To select and deselect genes or domains from
analysis, click in the corresponding checkbox. When a gene is selected but some domains within the gene are not, the
checkbox for the gene will be grayed. If you deselect a gene, all domains within thatgene are automatically deselected.

On the right side of the gene and domain hierarchy, you will find at least four columns of information for each domain
and gene. All information shown for genes is computed based on the domains contained.

The first two columns show the site number in the sequence where the domain begins (From column) and where it
ends (To column). The total number of sites shown next to the To column indicates the total number of sites
automatically computed, based on the range of information given in the previous two columns. A question mark (?)
shows that the domain exists but that the range of sites is not yet specified.

To specify or change sites that belong to a given domain, click on the domain name. The corresponding rows in
the From and the To columns contain a button with three dots (ellipses). To change the start site, click on the ellipses
in the From column. This will bring up a small Site Pickerdialog box with which you can highlight the desired site and
213
click OK. In this viewer, you will see that sites have different background colors. A white background
marks independent sites, a red background indicates that the site is used by another domain, and a yellow background
shows that the current site belongs to the domain being edited. To cancel any changes, click on Cancel in the Site
Picker dialog box.

For nucleotide sequences, two additional columns are found in the Define/Edit/Select tab: the Coding? column and
the Codon Start column. A check-mark in the Coding? column shows that a given domain is protein coding. If it is
checked, then the next column allows you to specify whether the first site in the domain is in the first, second, or the
third codon position.

Site Labels Tab


This tab displays sequences and allows you to label individual sites. To do this, change the default underscore (_) in
the topmost line to the label of choice and give it a light green background. The site number will be displayed below
in a window, next to which is shown the name of the domain, along with gene, name. Labeled sites can be selected or
deselected for analysis.

To change or give a label to a site, click on the site and type in the character you wish to mark it with. You can use the
left and right arrow buttons on the keyboard to move to and then label adjacent sites. To change a label, simply
overtype it. To remove a label, use the spacebar to type a space.

Example
Imagine an alignment consisting of a genomic sequence, including a gene and its upstream and downstream regions.
You can define each intron and exon as a domain, and then define the overall gene, assigning the exons and introns to
that gene. The upstream and downstream regions also can be defined as domains, or possibly multiple domains,
depending on the analysis you wish to perform. These domains do not have to be assigned to any gene. Furthermore,
some sites may be left unassigned, as independent sites. These can be scattered throughout the sequence and can be
included or excluded from analysis as a group. If you have a complicated patterns of sites you wish to analyze as
groups, and the domain gene approach is unsuitable, you should assign a category to these sites, which can be
specified in addition to the groups and domains.

Setup/Select Taxa & Groups Dialog

This dialog box has two sub-windows (Taxa/Groups and Ungrouped Taxa), a panel bar between them containing a
few buttons, and a command panel, with the lower part containing the Add, Delete, Close, and Help buttons.

Taxa/Groups sub-window on the left: It shows all the currently defined taxa and group names hierarchically. If a
taxon has been assigned to a group, it will appear connected to that group. Groups may be displayed in a collapsed
format (indicated by a + mark before their name). You can click '+' to expand the group to a listing of the taxa
contained in it, and click ‘–‘ to collapse the group to only view the group name. Groups that do not contain any
members do not have this box. Next is a checkbox indicating whether a given group or taxon will be included in an
analysis. Following that is an icon indicating a taxon (single box) or a group (layer of boxes). Grayed out check boxes
are used to indicate that some of the taxa in a group are selected and others are unselected. You can rearrange the
order of taxa and groups using drag-and-drop. However, note that this order is not automatically used in the Data
Explorer. To enforce this order, use the Sort command in the Data Explorer.

Ungrouped Taxa Sub-window on the right: This shows the names of all the taxa that do not belong to any of the
groups to facilitate your ability to move taxa into groups. If this sub-window does not appear on your screen, then
hold and drag the lower right corner of the dialog box to expand its width to unhide it.

Middle Command Panel: This resides between the above-mentioned two sub-windows and contains a splitter on its
right edge. You can grab the splitter and move it to change the proportion of the space taken by the two sub-
windows. In this panel left and right arrow buttons are used to add or remove taxa from the groups. Clicking the
hand-with-a-pencil icon with a highlighted taxon or group name will allow you to edit that name.

Lower Command Panel: In the lower part of the Select/Edit Taxa/Groups window are buttons that are used to add
and/or delete groups. The ‘+’ and ‘–‘ buttons are also present on the middle command panel.
214
Saving and Restoring Groups: You can save and restore which groups each taxa are stored in. This can prevent you
from needing to setup the groups each time. Normally you would just save the session (using session
saving). Although if you wanted to edit your data outside of MEGA then you would need to use a MEG file and use
this to restore the groups.
Buttons Description
Add Creates a new group.
Delete Deletes the currently selected group. Any taxa that were assigned to the group will become freestanding.
Ungroup Makes all the taxa in the selected group freestanding, but does not remove the group from the list.
Close Closes the dialog box.
Help Brings up help regarding the dialog box.

How to perform functions:

Function Description
Creating a new group Click on the Add button. Click on the highlighted name of the group and type in a new
name.
Deleting a group Select the group and click the Delete button. Any taxa that were assigned to this group
will become freestanding.
Adding taxa to a group Drag-and-drop the taxon on the desired group or select one or more taxa in
the Ungrouped Taxa window and click on the left arrow button on the middle command
panel.
Removing a taxon Click on the taxon and drag-and-drop it into a group (or outside all groups). Or, select
from a group the taxon and click on the right arrow button on the middle command panel.
Include/Exclude taxa Click the checkbox next to the group or taxa name.
or groups

Select Genetic Code Table Dialog

This dialog selects the desired genetic code, and edits and displays the properties of the genetic codes. At present only
one genetic code can be selected in MEGA at any given time; it is used for all coding regions in all sequences in the
data set.

To select a genetic code, click in the square box to its left.


You can also highlight any genetic code by clicking on the text.
You can then use the following buttons found along the top of the dialog box:
Button Description
Creates a new genetic code table. A code table editor will be shown with the genetic code of the
Add
currently highlighted code table loaded.
Removes the highlighted genetic code from the list. Note that the standard genetic code cannot be
Delete
deleted.
Modifies the highlighted genetic code or its name. The code table editor will be invoked for editing the
Edit
genetic code.
Displays the highlighted genetic code in a printable format.
View
Displays the number of synonymous and non-synonymous sites for the codons of the highlighted genetic
Statistics
code following the Nei-Gojobori (1986) method. The degeneracy values for the first, second, and
third codon positions are displayed following Li et al. (1985).

Appendix C: Error Messages


215
Blank Names Are Not Permitted

As this error message suggests, you cannot leave the name of a sequence, taxa, domain, or gene blank.

Data File Parsing Error

An error occurred while parsing the input data file. Pay close attention to the message provided, then look for the
error that occurred just prior to the event indicated in the file.

Dayhoff/JTT Distance Could Not Be Computed

The Dayhoff/JTT matrix-based correction could not be applied for one or more pairs of sequences. If you wish to
know which pair(s), use the Distances|Pairwise option. They will be shown in the Distance Matrix Dialog with a red
n/c (not computable).

Domains Cannot Overlap

Any given site can belong to only one domain, at most. If you would like to assign a site or range of sites belonging
to one domain to a second domain, you must first change or delete the definition of the first domain.

Equal Input Correction Failed

This error message means that, the Equal Input Model-based correction could not be applied for the amino acid
distances estimation. If you wish to know which pair(s) of sequences has this problem, use
the Distances|Pairwise option. All such pairs will be shown in the Distance Matrix Dialog with a red n/c (not
computable).

Fisher's Exact Test Has Failed

Fisher's exact test uses estimates of the number of synonymous sites (S), the number of nonsynonymous sites (N), the
number of synonymous differences (Sd), and the number of nonsynonymous differences (Nd). It fails for a number of
reasons. If the numbers are very large, some mathematical functions may not be able to handle them, although we
have tried to avoid this by using logarithms of factorials. To diagnose the problem, compute S, N, Sd, and Nd using
the Distances|Pairwise option four times. If you still cannot find the problem, please contact us

Gamma Distance Failed Because p > 0.99

For amino acid distance estimation, if the proportion of amino acids between two sequences that are different has
exceeded 99%, the gamma distance cannot be calculated. To know which pair(s) of sequences has this problem, use
the Distances|Pairwise option. All such pairs will be shown in the Distance Matrix Dialog with a red n/c.

Gene Names Must Be Unique

MEGA requires that all gene names in a genome be unique, although, for convenience, many domains can have the
same name. For example, you may want to give the name Exon-1 to the first exon in all genes.

Inapplicable Computation Requested

You have requested a computation that is not allowed or is unavailable for the currently active dataset. If you think
that this is in error, then please report this potential software bug to us.

Incorrect Command Used

216
The selected command or option is not valid here. Please look at the brief description provided in the error message
window to determine the nature of the problem.

Invalid special symbol in molecular sequences

Unique ASCII characters, except letters and '*', can be used as special symbols for alignment gaps, missing data, and
identical sites. Frequently used symbols for identical sites, alignment gaps, and missing data are '.', '-', and '?',
respectively. This error message means that you have attempted to use the same symbols for two or more of these
types of sites, or a chosen symbol is not appropriate. For example, do not use N (the ambiguous site symbol for
DNA/RNA sequences), or X (the ambiguous site symbol for protein sequences) because they are already available as
the IUPAC symbols for molecular sequences.

Jukes-Cantor Distance Failed

The Jukes-Cantor correction is used to calculate nucleotide distances


and synonymous and nonsynonymous substitution distances. If the proportion of sites that are different (nucleotides,
synonymous, or nonsynonymous) is greater than or equal to 75%, the Jukes-Cantor correction cannot be applied. If
you see this error message, then this has happened for one or more pairs in your data. If you wish to know which
pair(s), use the Distances|Pairwise option. All such pairs will be shown in the Distance Matrix Dialog with a red n/c.

Kimura Distance Failed

The Kimura (1980) distance correction is used in a number of operations, including calculating nucleotide distances
and synonymous and nonsynonymous substitution distances. These formulas cannot be applied if the argument in the
logarithm approaches zero or becomes negative. If you see this error message, then this has happened for one or more
pairs in your data. If you wish to know which pair(s), use the Distances|Pairwise option. All such pairs will be
shown in the Distance Matrix Dialog with a red n/c.

LogDet Distance Could Not Be Computed

The formula used for calculating distances contains many log terms. If some of their arguments approach zero too
closely or become negative the LogDet correction cannot be applied. If you wish to know which pair(s) of sequences
has this problem, use the Distances|Pairwise option. All such pairs will be shown in the Distance Matrix Dialog with
a red n/c (not computable).

Missing data or invalid distances in the matrix

The selected set of taxa contains one or more pairs for which the evolutionary distance is either invalid or not
available. Please inspect the distance data in the Data Explorer to identify those pairs and remove one or more taxa,
as needed.

No Common Sites

For the sequences and data subset options selected, MEGA found zero common sites. If you selected the complete
deletion option then you might achieve better results using the pairwise deletion option, as complete
deletion removes all sites containing a gap in any part of the alignment. If you selected the pairwise deletion option
then MEGA was unable to calculate the distance between one and several of the sequence pairs in the alignment. To
identify such pairs compute a pairwise distance matrix using the p-distance method and look for the word “n/c” in
place of the pairwise distance value.

Not Enough Groups Selected

The currently active dataset or subset does not contain enough groups to conduct the desired analysis. Please define or
select more groups using the Setup Taxa and Groups Dialog.

Not Enough Taxa Selected


217
The currently active dataset or subset does not contain enough sequences or taxa to conduct the desired
analysis. Please add or select more sequences.

Not Yet Implemented

The task you requested was not activated. This function either was not being available in your release of MEGA or
needs to be activated by us. Please contact the authors and report this software bug at your earliest convenience.

p distance is found to be > 1

This peculiar situation can occur in the computation of the proportion of synonymous (or nonsynonymous)
substitutions per site, especially when the number of included codons is small. If you wish to know which pair(s) of
sequences has this problem, please use the Distances|Pairwise option. All such pairs will be shown in the Distance
Matrix Dialog with a red n/c.
The Kimura (1980) distance correction is used in a number of operations, including calculating nucleotide distances
and synonymous and nonsynonymous substitution distances. These formulas cannot be applied if the argument in the
logarithm approaches zero or becomes negative. If you see this error message, then this has happened for one or more
pairs in your data. If you wish to know which pair(s), use the Distances|Pairwise option. All such pairs will be
shown in the Distance Matrix Dialog with a red n/c.

Poisson Correction Failed because p > 0.99

For an amino acid estimation of distances, the proportion of amino acids that differ between two sequences has
exceeded 99% and the Poisson correction distance formula cannot be applied. If you wish to know which pair(s) of
sequences has this problem, use the Distances|Pairwiseoption. All such pairs will be shown in the Distance Matrix
Dialog with a red n/c (not computable).

Tajima-Nei Distance Could Not Be Computed

For one or more pairs of sequences, the Tajima-Nei correction could not be applied, which usually occurs if the
argument in the log term of the formula becomes too close to zero. If you wish to know which pair(s) of sequences
has this problem, use the Distances|Pairwise option. All such pairs will be shown in the Distance Matrix Dialog with
a red n/c (not computable).

Tamura (1992) Distance Could Not Be Computed

For one or more pairs of sequences, the Tajima-Nei correction could not be applied. This usually occurs if the
argument in the log term of the formula becomes too close to zero or if it is negative, or if the G+C-content is 0% or
100%. If you wish to know which pair(s) of sequences has this problem, use the Distances|Pairwise option. All such
pairs will be shown in the Distance Matrix Dialog with a red n/c (not computable).

Tamura-Nei Distance Could Not Be Computed

The Tamura-Nei distance formula contains many log terms. If some of their arguments approach zero too closely or
become negative, the Tamura-Nei model correction cannot be applied. If you wish to know which pair(s) of
sequences has this problem, use the Distances|Pairwise option. All such pairs will be shown in the Distance Matrix
Dialog with a red n/c (not computable).

Unexpected Error

While carrying out the requested task, an unexpected error has occurred in MEGA. Please contact the authors
and report this software bug as soon as possible. We will try to solve the problem at the earliest possible time.

User Stopped Computation

You have aborted the current process by pressing the Stop process button on the progress indicator.
218
GLOSSARY

219

You might also like