Jeffrey Thorne

Professor Emeritus

919-515-1946 thorne@ncsu.edu Website

Bio

Education

Ph.D Genetics University of Washington 1991

Area(s) of Expertise

Bioinformatics
Genetics

Publications

A deep-learning-based score to evaluate multiple sequence alignments , bioRxiv (Cold Spring Harbor Laboratory) (2026)
Using drift coefficients as a basis for inferring times, effective population sizes, and genetic adaptations , Molecular Biology and Evolution (2026)
Likelihood-based evaluation of character recoding schemes for phylogenetic analysis , bioRxiv (Cold Spring Harbor Laboratory) (2025)
Interlocus Gene Conversion, Natural Selection, and Paralog Homogenization , Molecular Biology and Evolution (2023)
Scalable Bayesian Divergence Time Estimation With Ratio Transformations , Systematic Biology (2023)
Convergent evolution of polyploid genomes from across the eukaryotic tree of life , G3 Genes Genomes Genetics (2022)
Correlations between alignment gaps and nucleotide substitution or amino acid replacement , Proceedings of the National Academy of Sciences (2022)
Exome sequencing of hepatocellular carcinoma in lemurs identifies potential cancer drivers , Evolution Medicine and Public Health (2022)
Measuring Phylogenetic Information of Incomplete Sequence Data , Systematic Biology (2021)
Pedigree-based and phylogenetic methods support surprising patterns of mutation rate and spectrum in the gray mouse lemur , Heredity (2021)

View all publications

Grants

Date: 08/01/18 - 7/31/24

Amount: $564,338.00

Funding Agencies: National Science Foundation (NSF)

A variety of mutational mechanisms generate repeated sequences. Following their formation, the evolutionary fates of individual repeated elements are intertwined. Interlocus gene conversion (IGC) is one source of this dependence and motivates the proposed research. Available techniques for inferring phylogenies, estimating divergence times, and studying molecular evolution largely ignore IGC. Following a duplication, the resulting paralogs are conventionally assumed to experience the same speciation events but to otherwise independently evolve. This ignores influence that the sequence of one paralog can have on the other. If two paralogs have different nucleotide types at corresponding sequence positions, IGC can homogenize the corresponding positions by introducing the type from one paralog into the other. Due to IGC, a multigene family that arises via duplication events can consist of paralogous sequences that are more similar than would be the case if paralogs changed independently. Failure to account for IGC is therefore potentially problematic for phylogeny inference, divergence time estimation, and attempts to characterize the process of molecular evolution. We are developing inferential procedures that accommodate IGC.

Date: 04/01/18 - 3/31/22

Amount: $524,897.00

Funding Agencies: National Science Foundation (NSF)

The use of molecular markers of self-identity as a basis for immunity marks a major evolutionary innovation in the early history of vertebrates. It is well established that self versus non-self recognition has spurred a co-evolutionary competition between vertebrate hosts and pathogens driving both high levels of inter- and intra-specific immune gene sequence diversity. Although immune gene diversification is likely essential for a species to survive new pathogens, the origin and evolutionary dynamics of vertebrate self versus non-self recognition remain poorly understood. As a group, ray-finned fish (Actinopterygii) constitute over half of the extant vertebrates on earth and display greater species diversity than any other group of vertebrates making them a powerful system for understanding the genetic and functional evolution of immune genes. Fish not only share certain immune gene families with mammals, but also encode a number of "fish-specific" immune gene families. This project will integrate new transcriptome and genomic sequence data from multiple early diverging lineages of ray-finned fishes with established sequence data from other fishes and a phylogenetic comparative framework to 1) establish the evolutionary origins of fish-specific immune receptors, 2) determine if genomic organization influences rates of immune gene family evolution and 3) define co-evolutionary relationships between markers of "self" and their candidate receptors. This study will not only provide a perspective on the early history of the vertebrate immune system, but will also reveal novel molecular innovations to pathogen resistance in vertebrates.

Date: 07/01/16 - 4/30/21

Amount: $1,140,158.00

Funding Agencies: National Institutes of Health (NIH)

This proposal uses a combination of atomistic simulations, single-molecule FRET experiments and statistical analysis to analyze atypical structures associated with trinucleotide repeats diseases. Given the high rate of mutation of these atypical structure, their inferences for evolution will be explored.

Date: 12/01/16 - 8/31/20

Amount: $62,714.00

Funding Agencies: Morris Animal Foundation

Killer T cells, which fight pathogens by destroying infected cells, recognize their targets through signals delivered by surface Major Histocompatibility Complex (MHC) molecules. To reduce pathogen vulnerability, the MHC system maintains high genetic diversity, i.e., many alleles in the gene pool. Studying T-cell responses across individuals thus requires finding highly shared alleles. We were the first group to define the 3 MHC genes in cats, designated Feline Leukocyte Antigen (FLA)-E, -H and -K. Here, we propose to clarify the relationship of FLA alleles and their genes of origin, and to catalog prevalent alleles. These objectives will be accomplished by a next-generation FLA DNA sequencing-based survey of DNA collected from 100-200 unrelated, mixed-breed cats from around the world. This data set will be used to populate a currently-empty feline database in the public library of comparative MHC diversity maintained by the European Bioinformatics Institute (https://www.ebi.ac.uk/ipd/mhc/). The evolutionary relationship of the different FLA class I alleles will also be investigated in this study.

Date: 08/01/10 - 7/31/15

Amount: $1,044,720.00

Funding Agencies: National Institutes of Health (NIH)

Because evolution occurs within populations, evolutionary inference is ideally framed with respect to population genetic parameters. Although elaborate statistical techniques exist for analyzing interspecific molecular sequence data, a population genetic basis for these techniques is often absent or unclear. This is unfortunate because interspecific comparisons are the only way to study most of evolutionary history. This project aims to make population genetic inferences from interspecific DNA sequence data. The focus is on the relative fitnesses conferred by protein-coding DNA sequences. A central quest of evolutionary study is to understand how phenotype affects survival of the genotype. Much biological research involves connecting genotype to phenotype, but evolutionary biology has an even more ambitious goal. It is the task of evolutionary biologists to connect genotype to fitness. The difficulty of this task is lessened when phenotype can be accurately predicted from genotype because the genotype-to-fitness mapping problem then becomes the slightly less intimidating phenotype-to-fitness mapping problem. We convert the former mapping problem to the latter by exploiting automated systems that make in silico predictions about phenotype from gene sequence data. These systems for predicting phenotype solely from genotype can never be perfect because phenotype depends on both genotype and environment. However, systems for predicting some aspects of phenotype are reliable enough to be useful and they have not been sufficiently exploited by evolutionary biologists. The statistical techniques that we are developing can be employed to examine the evolutionary consequences of any phenotype for which an in silico prediction system exists, but we are concentrating on the evolutionary role of protein tertiary structure. Our research stems from earlier work that we have done on assessing the evolutionary impact of protein tertiary structure from interspecific sequence data. We have now extended this work to a population genetic framework. We will continue our investigations via three interconnected lines of research: (1) More realistic descriptions of the evolutionary process: We will improve our treatment of protein tertiary structure and the natural selection that acts upon it. Sequences from the protein family of interest, empirically-derived information from a database of known tertiary structures, and covariates such as gene expression information will all influence our evolutionary models. We will also make our descriptions of the mutation process more realistic by having mutation rates depend on local sequence context and vary among genomic regions. The more realistic models that result will yield improved relative fitness estimates. These models will also provide the basis for more accurate inference of ancestral sequences and for characterization of the adaptive landscape. (2) Improved inference of population genetic parameters from interspecific data: Our current technique for making population genetic inferences from interspecific data requires a variety of restrictive assumptions, including the assumption that mutation rates are so low as to be able to neglect the possibility that multiple fitness-effecting polymorphisms are simultaneously segregating in a population. We intend to relax these assumptions as well as to explore their consequences via simulation. (3) Predicting health-related effects of nonsynonymous variation in humans: Because our interspecific models are framed in terms of population genetics, we can combine the interspecific and intraspecific data in a sensible way. PI Stone has developed a successful approach for predicting which nonsynonymous variation has health-related effects. Our carefully obtained estimates of the fitness consequences of genetic variation will allow further improvement of this already successful approach.

Date: 09/30/09 - 7/31/15

Amount: $361,092.00

Funding Agencies: National Institutes of Health (NIH)

This project will simultaneously consider protein tertiary structure information and protein sequence information to improve statistical methods for proteins. The focus will be on sequence alignment (the correspondence between positions in different protein sequences) and how sequence alignment can be probabilistically incorporated into the inference of evolutionary trees.

Date: 09/01/10 - 8/31/14

Amount: $600,000.00

Funding Agencies: National Science Foundation (NSF)

This is a theoretical proposal using techniques of bioinfomatics and classical molecular dynamics to investigate the folding and evolution of select protein systems subject to point mutations leading to the formation of "bridge" or "interface" structures.

Date: 07/01/12 - 6/30/13

Amount: $25,000.00

Funding Agencies: NCSU Research and Innovation Seed Funding Program

The Wyoming toad is a highly endangered species maintained in the wild only through annual release of captive bred individuals. This species is highly susceptible to chytridiomycosis, a fungal infection implicated in the global decline of amphibian populations. Disease is identified as a major threat to survival of the species. Knowledge of the genes or genome of this endangered species is needed to guide recovery management and to increase our understanding of amphibian mechanisms of resistance to infectious disease. The proposed research will generate and analyze transcriptome sequences from captive Wyoming toads, isolated as three distinct populations. Using these data we will identify new polymorphisms for assessing the genetic diversity of this species and generate an inventory of immune-related genes within this species. The resulting transcriptome sequences will provide the foundation for future studies on evaluating the genetic diversity of the extant toads, identifying biomarkers for chytridiomycosis and possibly identifying individual toads resistant to chytridiomycosis for release into the wild.

Date: 08/01/05 - 7/31/10

Amount: $634,795.00

Funding Agencies: National Institutes of Health (NIH)

Protein tertiary structure changes slowly during evolution. Nucleotide substitution rates are expected to be low if they result in an amino acid replacement that disrupts protein structure. The effect of an amino acid replacement on tertiary structure is determined not only by the residues involved in the replacement but also by the residues that are spatially nearby the site that experiences the replacement. This relationship between protein structure and protein change induces an evolutionary dependence among the positions in the protein-coding genes. Unfortunately, widely used models for the evolution of protein-coding genes ignore this dependence among positions due to protein structure. The research in this project will build upon a newly developed statistical technique for making evolutionary inferences from sequence pairs. This new technique incorporates dependence among codons due to pairwise amino acid interactions that are imposed by the protein tertiary structure. The initial focus will be to extend this model-based approach to the analysis of more than two phylogenetically-related sequences. The resulting method will be a powerful tool for characterizing the impact of protein structure on protein evolution. The Pandit database of aligned protein-coding DNA sequences will be mined to assess which protein families evolve under the most and least influence of tertiary structure. Evidence of positive selection in this database will be identified and the issue of whether the strength of the relationship between protein structure and protein evolution varies among taxonomic groups will be addressed. To complement the empirical studies and to further evaluate the new methodology, simulations will be performed. These simulations will illuminate the statistical properties of the new methods for making evolutionary inferences when sequence positions do not change independently. Although the emphasis of this project is evolutionary dependence among codons due to protein structure, the statistical approach is quite general and could be applied to diverse cases of evolutionary dependence where surrogates for sequence fitness can be measured or modelled.

Date: 04/01/05 - 3/31/09

Amount: $150,000.00

Funding Agencies: National Science Foundation (NSF)

This project will improve Bayesian methods for estimating evolutionary rates and divergence times. We will focus on improving the treatment of fossil data, models of sequence change, and duplications of genes or gene regions.

View all grants

Jeffrey Thorne

Bio

Education

Area(s) of Expertise

Publications

Grants

Groups

Find NC State websites, locations and people

MyPack Portal

University Libraries

Academic Calendar

Majors and Careers