How to Create a Phylogenetic Tree From DNA Sequences

A phylogenetic tree is a diagram that visually represents the evolutionary history and relationships among a group of organisms or genes. These diagrams illustrate the concept of common descent, showing how different biological entities have diverged from shared ancestors over time. Tracing these connections is fundamental to biology, allowing researchers to map the spread of traits, understand species diversification, and classify life based on ancestry. Creating such a tree is a rigorous, multi-step process that relies heavily on computational analysis of biological data.

Gathering the Data for Comparison

Modern evolutionary studies rely on molecular data, such as DNA or protein sequences, to construct these evolutionary diagrams. This approach offers a distinct advantage over older methods that depended solely on physical characteristics, or morphology, which can sometimes be misleading due to convergent evolution. The initial step involves selecting the specific genetic material that will be compared across all the organisms being studied.

Researchers must select sequences that are homologous, meaning they were inherited by all the compared species from a single common ancestor. For example, comparing highly conserved ribosomal RNA genes is preferred for deep evolutionary relationships, while mitochondrial DNA is better suited for comparing closely related species. Scientists must then obtain these sequences for every organism included in the analysis, often retrieving them from public databases like GenBank or through direct laboratory sequencing.

Preparing Sequences for Analysis

Before evolutionary calculations can begin, the collected DNA sequences must undergo a process called sequence alignment. This procedure is performed to ensure that the computer compares corresponding positions, or homologous sites, across all the different sequences. Since evolution involves mutations, some sequences may have extra bases inserted or others deleted over time, making them different lengths.

Alignment software systematically shifts the sequences until the most similar characters—the same base pairs in DNA or amino acids in protein—are stacked directly above each other. When a base is missing, the alignment introduces a “gap,” which is represented by a dash. These gaps are biologically significant, as they represent an evolutionary event known as an insertion or a deletion (indels). By correctly lining up these sequences, the preparation ensures that the subsequent statistical analysis can accurately quantify the differences and similarities between corresponding inherited traits.

Selecting the Tree Building Method

Once the sequences are aligned, scientists must choose a computational method to translate the genetic differences into a branching tree structure. There is no single, universally correct algorithm for tree construction, as different methods make varying assumptions about the evolutionary process. The choice of method often depends on the type of data, the number of sequences, and the time scale of the evolutionary relationships under investigation.

Distance Methods

Distance Methods, such as Neighbor-Joining, simplify the data by first calculating a single numerical score that represents the overall genetic dissimilarity between every pair of sequences. These methods then use these scores to rapidly group the most similar sequences together, offering a quick way to generate a preliminary tree.

Parsimony Method

The Parsimony Method operates on the principle that the simplest explanation is generally preferred. This method searches for the tree structure that requires the absolute fewest evolutionary changes—the minimum number of base-pair substitutions—to explain the observed differences among the sequences.

Model-Based Approaches

The most sophisticated methods are Model-Based Approaches, including Maximum Likelihood and Bayesian methods. These techniques incorporate explicit statistical models that describe how DNA substitution occurs over time. Maximum Likelihood calculates the probability of the data given a particular tree, while the Bayesian method estimates the probability of the tree given the data. These methods are frequently considered the most robust for reconstructing complex evolutionary histories.

Reading and Validating the Completed Tree

The final phylogenetic tree is a map of evolutionary relationships, composed of several distinct parts that must be correctly interpreted. The lines on the diagram are called branches, and they represent the evolutionary paths taken by the organisms. The points where branches split are called nodes, and each node represents a hypothetical common ancestor from which the descendant lineages diverged.

The ends of the branches, known as tips, represent the species, populations, or genes being studied in the present day. To properly understand the direction of evolution, the tree must be rooted, which means designating the oldest common ancestor for the entire group. This is often accomplished by including an outgroup—a species known to have diverged before all the others in the study—to orient the entire diagram correctly.

Since any phylogenetic tree is a statistical inference of history, scientists must assess the reliability of the branching pattern using statistical support measures. The most common method involves bootstrapping, which generates hundreds or thousands of slightly altered data sets by randomly sampling from the original alignment. The computer then builds a tree for each altered data set. The resulting bootstrap value, typically displayed as a percentage on a branch, indicates how many of the resampled trees supported that specific branching arrangement. A higher bootstrap value, such as 95% or more, provides greater confidence that the evolutionary relationship shown in that part of the tree is accurate.