Introduction

Chilis and sweet (bell) peppers or paprika are members of the genus Capsicum spp. (Solanaceae), which is thought to contain at least 35 species and which originated in Central and South America [1]. Certain plants of this genus have global importance as vegetables, medicine, and ornamentals. Especially C. chinense Jacq., C. baccatum var. pendulum L., C. pubescens Ruiz & Pav., C. annuum L. and C. frutescens L. have high economic value and are widely cultivated and traded globally [2, 3]. Archeological microfossil evidence has indicated that peppers were domesticated in the Americas and have been consumed in this region for more than 7 000 years [4]. Pepper was introduced into China during the late 16th century [5]. Due to the complex geographical environment and climatic conditions, abundant germplasm resource of pepper have evolved or been shaped in China.

The investigation of the genetic diversity present in a group can help to reconstruct the origins and evolution of the species within that group, and can be invaluable in genetic breeding programs [6]. Because the Capsicum group is so large, understanding its genetic diversity is key to exploiting its genetic resources fully. The genetic diversity, population structure and phylogenetic relationships within C. annuum have been investigated in previous studies using SSR markers [7,8,9], morphological, physiological and yield traits [10, 11], and SNPs [12, 13]. Other studies have focused on the genetic variation within C. frutescens [2, 14,15,16], C. pubescens [18] and C. chinense [17,18,19] using morphological, physiological and molecular markers. The use of different markers for each species also makes comparison of genetic diversity across species impossible. Moreover, most of the molecular markers used in these studies were markers from nuclear genes, and to date, the (cp.) genome has been only rarely used to study Capsicum genetics.

The cp. genome, which is inherited from the maternal parent, is a circular, double-stranded, DNA molecule [20, 21]. It is small, with a low molecular weight, and ranges in size between 120 and 160 kb [31]. The structure of the cp. is relatively simple and quadripartite, with large (LSC) and small (SSC) single-copy regions often separated by two inverted repeat (IR) regions [22,23,24,25]. In angiosperms, between 110 and 130 genes are normally present on the cp. genome [26].

The cp. genome related to biosynthesis, transcription/translation and photosynthesis [27] and is inherited from the maternal parent [20, 21]. Despite the sequence and gene content of the plant cp. genome are highly conserved [25], gene loss, mutation, and pseudogenization often occurs in evolutionary history [28]. Cp genomes, proposed as “DNA super barcodes” by some previous studies [20], which are often used in species identification and analyses of genetic diversity [27, 29, 30], and has been widely used to revealed the phylogenetic, taxonomic and evolutionary studies [31]. After the first assembling of cp. genome of Nicotiana tabacum [32], and following the rise of sequencing technology, the cp. genomes of many plant species have been sequenced.The cp. genomes of several Capsicum species, including C. annuum C. annuum [33], Capsicum annuum var. glabriusculum [34], C. frutescens [35], C. chinense [36], C. baccatum [37], C. eximium [38], and C. eximium [43] have also been sequenced in previous studies. However, to date, there has been no comparison of genetic diversity among the different Capsicum species based in cp. genome. In this study, we re-sequenced and assembled the cp. genomes of 32 samples of Capsicum landraces and varieties. We then analyzed these genomes for GC content, number of genes, number of repeat sequences, codon usage bias and number of simple sequence repeats (SSR). The gene sequences of the IR regions and gene differentiation within different Capsicum species were compared. The sequences generated in this study were then combined with those of other Solanaceous species downloaded from NCBI, and the evolutionary relationships between Capsicum and other species of Solanaceae as inferred from the cp. genomes was discussed.

In conclusion, this study aimed to: (a) Characterize the cp. genomes of diverse Capsicum varieties; (b) Explore differentiation in the cp. genomes of representative Capsicum varieties, and (c) Reveal the phylogenetic relationships among Capsicum varieties and represent species of Solanaceae based on the complete cp. genome. These results will further our knowledge of the evolutionary origins of the Capsicum genus, and will enable future researchers to more fully utilize the germplasm resources available for this diverse and globally important group.

Materials and methods

Plant materials, DNA extraction and whole genome resequencing

We sampled a total of 32 pepper samples from three Capsicum species (C. annuum, C. chinense, and C. frutescens) taken from six countries. Samples had been collected by our project team on previous expeditions (Table 1). Seeds from these samples were planted in a greenhouse and the fresh leaves were harvested from the young plants when they were 20 days old. A voucher specimen of C. annuum has been deposited in the herbarium of the Kunming Institute of Botany, Chinese Academy of Sciences (KUN 1395981).

Table 1 Sample information of 32 Capsicum species

The harvested fresh leaves were stored in a deep freeze at -80 °C until needed. Total DNA was then extracted from the leaf material using a CTAB method as previously described [39]. The DNA was quantified using agarose gel electrophoresis (Omega Bio-Tek, Norcross, GA, United States), and the DNA in each sample was quantified and assessed for quality using a fluorometer (Qubit3.0, Thermo Fisher Scientific, Waltham, MA, United States). High-quality samples were then standardized to 10 µl and 200 µg DNA.

An Illumina NovaSeq6000 sequencing platform was used to fragment the genomic DNA, and the fragmented DNA (insert sizes ~ 450 base-pairs (bp)) was used to constructed libraries, resulting in the generation of 150 bp paired-end reads. Low-quality reads were filtered out of the raw sequencing data using fastp 0.21.0. The average depth of coverage was 5 ×. Low-quality reads (50% or more of the bases with a quality score < 10) were filtered out. The clean data were used in the subsequent analyses.

Chloroplast genome assembly, gene annotation and sequence analysis

The GetOrganelle pipeline (https://github.com/Kinggerm/GetOrganelle) was used for assembly of the clean reads, and the contigs were checked using a C. annuum reference genome (NC028007) in BLAST (https://blast.ncbi.nlm.nih.gov/). The reference genome was then used to position and align the contigs. We checked for possible misassembled regions through mapping the raw reads onto the final contig and observing the coverage patterns. and then used the CpGAVAS pipeline [40] to automatically annotate the genome, and Geneious 8.1 [41] to identify the start/stop codons and intron/exon boundaries. tRNAscan-SE v2.0 [42] was then employed to identify regions encoding tRNAs, and we used OGDraw v1.2 (http://ogdraw.mpimp-golm.mpg.de/) [43] to draw a physical map of the cp. genome.

Analysis of the features of the chloroplast genome

Repetitive sequences in the Capsicum cp. genome longer than or equal to 16 bp, including palindromic repeats, forward tandem repeats, complement repeats and reverse repeats, were identified using REPuter [44], with a minimum alignment scored and maximum period size of 500. SSR markers were identified using Phobos v3.3.12 [45] and SSRHunter [46], which can identify multinucleotide repeats with four or more copies and with lengths of between 2 and 6 bp. Codon usage analysis and calculation of relative synonymous codon usage (RSCU) were conducted using the MEGA v11 software [47].

Chloroplast genome genetic diversity analyses of Capsicum individuals

An alignment of cp. genome sequence data that could be used in DNAsp analysis was produced in MAFFT V7.471 (Kazutaka Katoh, Japan) [48]. We then used DNAsp to identify the single nucleotide polymorphisms (SNPs) and insertion/deletion polymorphisms (indels) present [49]. This analysis generated genotype data files and calculated the haplotype diversity (Hd). We then conducted a sliding window analysis using DNAsp [49] with a window length of 100 bp and a step size of 25 bp, Genetic relationships between the different genotypes present in our analyses were investigated through a genotype network analysis in NETWORK v10200 [50]. All the above analyses included all the indels identified in the aligned sequences.

Comparison of the genome within the genus Capsicum

The borders between the IR and the SSC and LSC regions may change over evolutionary time, which may also result in size differences between the cp. genomes of different species or varieties [57,58,59]. We used IRscope10 [51] to investigate the IR boundaries in the cp. genomes of the three species sampled in our study (C. annuum, C. frutescens and C. chinense) together with those of other six Capsicum species downloaded from NCBI (C. lycianthoides, C. chacoense, C. eximium, C. galapagoense, C. pubescens and C. tovarii). We used the online software mVISTA11 [52] with the Shuffle-LAGAN alignment model [53] and with C. annuum as a reference genome, to investigate differences among the cp. genomes of these nine Capsicum species.

Phylogenetic reconstruction and population structure analysis

We downloaded cp. genome sequences of 40 species in the Solanaceae as well as that of an outgroup (Helianthus annuus, Asteraceae) from GenBank. We then aligned these 40 sequences using MAFFT [48] together with the 3 species generated in our study. We reconstructed a maximum likelihood (ML) phylogenetic tree using MEGA v11 [47] using these sequences, with 1000 bootstrapping replicates, in order to investigate the phylogenetic placement of Capsicum within the Solanaceae. The nucleotide substitution model GTR + G + I was estimated with MEGA v11. We estimated a pairwise distance matrix with Maximum Composite Likelihood (MCL) and we applied the Neighbor-Join and BioNJ algorithms. The resulting topologies with superior log likelihood values were selected automatically as the starting trees for the heuristic search.

Results

Genetic diversity analyses of the Capsicum chloroplast genome based on 32 varieties

In this study, we sequenced and assembled the cp. genomes of 32 Capsicum varieties. A total of 13 genotypes were resolved in the 32 varieties, and each genotype was submitted to GenBank (Table 2). A total of 608 InDels and 83 SNPs were found in the genomes of the 32 Capsicum samples. Of these 83 SNPs, 8 were singleton variable sites and 75 were parsimony-informative sites, and the nucleotide diversity was 0.424. Most (67) of the total 83 SNPs appeared between the pungent group and non-pungent group, only 11 and 8 polymorphic sites within the the pungent group and non-pungent group, respectively. (Table 3). There was considerable genetic variation between the genomes of C. chinense and C. frutescens (the pungent varieties) and those of C. annuum (the non-pungent varieties). In the pungent group, we found 4 genotypes and 8 polymorphic sites, and the nucleotide diversity was 0.070. Genotype 10 represented thegenotype, and contained seven variates (genotype 10: CXJ4, CXJ6, CXJ7, CXJ8, CXJ74, CZM8, CZM9). In the non-pungent group, we found 9 genotypes and 11 polymorphic sites, and the nucleotide diversity was 0.042. Genotype 7 was most common and had seven variates (Genotype 7: CS13, CS14, CS17, LS5, TJ05, TJ06, TJ08) (Table 2). Considering, only 13 genotypes were revealed in the 32 varieties and the 13 genotypes (Genotype 1–13) divided into two lineages. We chose representative varieties (CXJ4 for C. frutescens and CS 13 for C. annuum for later characterization of the cp. genome.

Table 2 Genotype for of 32 Capsicum varieties
Table 3 Polymorphic sites and nucleotide diversity in Capsicum varieties

We constructed an ML tree using 13 varieties, each of which represented one of the variants identified above. We found that different varieties of C. annuum, C. chinense and C. frutescens clustered together with others of the same species, and forming monophyletic groups of that species. The genetic relationship between the two pungent species, C. chinense and C. frutescens, was close (Fig. 1A). Similar relationships were revealed in the Network analyses. We identified nine genotypes (G1-G9) from C. annuum, three genotypes (G10-G12) from C. frutescens, and one (G13) from C. chinense (Fig. 1B).

Fig. 1
figure 1

Genetic relationship of 13 genotypes. (A) Reconstructed maximum likelihood (ML) phylogenetic tree based on the chloroplast genome sequences of different varieties of Capsicum. Numbers to the right of nodes are bootstrap support values. (B) Network map showed genetic relationship of 13 genotypes

Features of the Capsicum chloroplast genome

The 13 representative Capsicum cp. genomes identified in this study ranged in length from 156,729 (CS13) to 156,950 bp (CXJ4), which was very similar to the lengths of most previously published cp. genomes (Table 4). We found that each of the 13 representative Capsicum cp. genomes formed a single, circular DNA sequence (Fig. 2). All showed a classical tetrad structure, and contained two copies of the IR, between 25,748 (CS13) and 25,847 bp (CXJ4) in length. The LSC was between 87,380 (CS13) and 87,344 bp (CXJ4) long, and the SSC between 17,853 (CS13) and 19,912 bp (CXJ4) (Fig. 2; Table 1). The 13 representative Capsicum cp. genomes analyzed all had a GC content of 37.7% (Table 4). In each of the analyzed Capsicum cp. genomes, we found 132 genes, of which 8 encoded ribosomal RNAs (rRNA), 37, transfer RNAs (tRNA), and 87 were protein-coding genes (PCGs) (Fig. 2). We were able to assign each gene to one of four functional categories: photosynthesis (47 genes), self-replication (73), biosynthesis (5) or unknown function (6) (Fig. 2; Table 5). Nineteen gene species in the IR were found to be either partially or fully duplicated. These included eight PCGs, (ndhB, rps7, rps12, rpl2, rpl23, ycf1, ycf2 and ycf15), seven genes encoding tRNAs (trnA-TGC, trnI-CAT, trnI-GAT, trnL-CAA, trnN-GTT, trnR-ACG and trnV-GAC), and the four rRNA genes (4.5 S, 5 S, 16 S and 23 S) (Table 5). 113 unique genes were found in the Capsicum cp. genome (79 PCG genes, 30 tRNA genes and 4 rRNA genes). The structure of the cp. genome is likely to be highly conserved in Capsicum, because the structural elements that we observed were identical in all of the 13 Capsicum varieties analyzed.

Table 4 Information of the complete cp. Genomes of 13 represented Capsicum varieties
Fig. 2
figure 2

Gene map of the Capsicum annuum (CS13) chloroplast genome. Note: the inner circle showed GC content, the outer circle showed genes

Table 5 Genes present in the Capsicum chloroplast genome

Analysis of SSRs, repeat sequences and codon usage bias in the Capsicum chloroplast genome

The two most common genotypes among the Capsicum individuals sequenced in this study were CS13 and CXJ4. We therefore selected these two genotypes for the following analyses of SSR, repeat and codon usage bias. CS13 and CXJ14 were found to be similar in the number and composition of SSRs, with each containing only 47 identified SSRs (Fig. 3A). We found 16 SSRs of type AT, and 10 of type TA. Most of the SSRs were dinucleotides (91.48% of the total SSRs), trinucleotides (0.64%) or tetranucleotides (0.21%). Twenty-three SSRs were found in the LSC region (Fig. 3A), ten in the IR regions, and four in the SSC. There was an A/T nucleotide bias in the Capsicum cp. SSRs, with A/T repeats making up 61.7%.

Fig. 3
figure 3

Types and distributions of repeat sequences and short sequence repeats (SSRs) in Capsicum chloroplast genomes. (A) Proportion of SSRs in Capsicum cp. genomes (CS17 and CXJ4). (B) Numbers of different types of repeat sequences in the Capsicum chloroplast genomes (CS17 and CXJ4)

We then investigated the presence of repetitive sequences in the Capsicum cp. genome. The varieties with the CS13 genotype contained 281 repetitive sequences, all between 16 and 76 bp in length. 109 forward repeats, 80 palindromic repeats, 61 reverse repeats and 31 complement repeats were found in this genotype. The CXJ4 varieties were found to contain 306 repetitive sequences, ranging in length between 16 and 76 bp. 122 forward repeats, 93 palindromic repeats, 63 reverse repeats and 28 complement repeats were found in the CXJ4 genotype. The numbers of the different types of repeats are given in Fig. 3B.

We next analyzed codon usage in the protein-coding genes. A total of 40 codons were found with an RSCU > 1.0, with AUU (4.14%), AAA (3.96%), GGA (3.84%), AAU (3.76%) and UUU (3.63%) being the most commonly used codons. Leu (L) was found to be the most common amino acid in the cp genome (3234 times), followed by Ser (S) and Ile (I), both of which were found > 2000 times. Trp (W) and Met (M) were the least commonly used amino acids, and occurred 485 and 620 times, respectively (Table 6). Codon preference analysis showed that codons which the 3’ ends containing A or T were preferred and the RSCU values always higher than 1.

Table 6 Codon usage and codon-anticodon recognition patterns of the Capsicum annuum cp. Genome

IR region expansion and contraction in the Capsicum cp. genome

The IR boundaries of cp. genomes of nine Capsicum species (C. annuum, C. chacoense, C. chinense, C. eximium, C. frutescens, C. galapagoense, C. lycianthoides, C. pubescens, C. tovarii) were then compared. Of these species, C. lyclanthoides has the longest complete cp. genome (155,583 bp), and C. pubescens has the shortest (157,390 bp). All nine Capsicum species studied had a cp. genome structure typical of the angiosperms: quadripartite, with large (LSC) and small (SSC) single-copy regions separated by two inverted repeat (IR) regions (Fig. 4). The regions spanning the IR/LSC and IR/SSC junctions were compared in our nine representative Capsicum species. We found that the IR characteristic regions varied in length between 25,624 in C. lyclanthoides, and 25,887 bp in C. pubescens. Similarly, the SSC ranged in length from 86,813 to 87,688, and the LSC ranged from 86,813 in C. lyclanthoides to 87,688 bp in C. pubescens. Thus, variation in the size of Capsicum genomes appears to occur as a result of variation in the lengths of the IR, SSC and LSC regions, rather than only on IR size variation as is usual in most species.

Fig. 4
figure 4

Comparison of border distance between adjacent genes and junctions of the LSC, SSC and two IR regions among the chloroplast genomes of seven Capsicum species. Boxes above or below the main line indicate the adjacent border genes. The figure is not to scale with respect to sequence length, and only shows relative changes at or near the IR/SC borders

Our nine study species varied slightly at the IR/LSC and IR/SSC junctions. The genes rps19, rpl2, ycf1 and trnH were found at either the IR/LSC or the IR/SSC boundary. The IRb-LSC boundary was similar in all the tested Capsicum species, as was the IRa-SSC boundary, and these were located in the genes rps19 and ycf1, respectively. The IRa-LSC boundary fell between rpl2 and trnH (Fig. 4). The IRb-SSC boundary was a little different. In seven of our study species, this boundary fell in the ycf1 gene, however, that of C. lyclanthoides was located in the intergenic region between the ycf1 and ndhF genes. Our results therefore suggest that in Capsicum species, the IR/LSC and IR/SSC junction regions are highly conserved.

mVISTA comparison of chloroplast genomes in Capsicum

We used mVISTA to construct multiple alignments of the nine study Capsicum cp. genomes. C. lyclanthoides was used as a reference genome (Fig. 5). We found that overall, the cp. genomes in Capsicum species were highly conserved. As is expected, the coding regions were found to be less divergent than non-coding regions, and the IR regions were more highly conserved than were either the LSC or SSC region. The intron-containing genes were highly variable in the Capsicum cp. genomes studied, and intergenic spacers (trnL-trnF, rps12-rpl20, rpl32-ndhF, trnV-rps7, rps16-trnQ, petA-psbL, and trnK-rps16) were the most highly divergent sequences in the nine chosen Capsicum cp. genomes. The trnK, rpl20, ycf1 and ycf2 sequences appear to evolve rapidly in Capsicum, as these were the coding regions with the highest divergence, and are therefore potentially of use as markers for the taxonomic classification and phylogenetic reconstruction of Capsicum species.

Fig. 5
figure 5

Comparison of four cp. genomes using the mVISTA alignment program. The x-axis represents the coordinates in the cp. genome. The y-axis indicates the average percent identity of sequence similarity in the aligned regions, ranging between 50% and 100%. Purple bars represent exons, blue bars represent untranslated regions (UTRs), pink bars represent noncoding sequences (CNS), gray bars represent mRNA, and white bars represent differences in genomics

In order to reveal the varieties discriminatory efficiency of Capsicum, we extracted the divergent rpl20 and trnK gene sequence of the cp. genome from the nine species, and align to data matrixes, using haplotype analysis of DNAsp, we found that there were 7 genotypes in rpl20 gene from the nine species (Fig. 6A). Both C. frutescens and C. tovarii, C. chinense and C. eximium shared the same genotype. Discriminatory efficiency reached to 77.8%. the trnK gene produced 6 genotypes (Fig. 6B) and the discriminatory efficiency reached to 66.7%, C. frutescens and C. tovarii shared the same genotype, so did C. eximium, C. chinense and C. galapagoense. Discriminatory efficiency applying cp. genome region is high.

Fig. 6
figure 6

Genetic variation sites of different species, (A) rpl20 gene, (B) trnK gene

Levels of sequence divergence and nucleotide variability (π) were then examined using DNAsp within our nine aligned Capsicum cp. genome sequences. Interestingly, we found that despite the relatively close relationships between our study species, the genomes were nevertheless divergent, with nucleotide variability (π) being 0.0025. We found a total of 1,408 mutations, of which were 1216 SNPs and 184 were parsimony informative. Most of this variation occurred in the LSC regions, while the IR region was relatively conserved, and intergenic regions on the picks of diversity were showed in sliding window analysis (Fig. 7), such as trnL-trnF, rps12-rpl20, trnV-rps7, rpl32-ndhF and rps7-rrn16S were highly variable, this result is consistent with mVISTA analysis. The cp. genome is therefore potentially informative for Capsicum phylogenetic reconstruction at the species level, with the LSC being useful in both phylogenetic and genetic diversity analyses.

Fig. 7
figure 7

Sliding window analysis of the complete chloroplast (cp.) genomes from nine different Capsicum species. The x-axis represents the midpoint of the window and the y-axis represents the nucleotide diversity (Pi) of each window. The window length is 100 bp with a 25-bp step size

Phylogenetic analysis of Solanaceae species to reveal the phylogenetic relationship of Capsicum species

We constructed an alignment of cp. genome sequences, including those of the three species sequenced in this study as well as 40 further species in the Solanaceae downloaded from GenBank (Fig. 8). The best nucleotide substitution model was estimated by MEGA to be GTR + G + I, and the whole cp. genomes were used in the reconstruction of the ML trees. The results of this analysis were consistent with previous phylogenetic reconstructions in this groups, and also with the traditional classification of the Solanaceae. The different genera, including Lycium, Physalis, Nicotiana, Petunia and others, can be distinguished, and the Capsicum species clustered closely together in a monophyletic group. We found that C. annuum and C. tovarii were sister groups with a bootstrap value of 100%, and that C. chinense, C. eximium and C. frutescens were also sister groups with a bootstrap value of 87%. The whole cp. genome is therefore appropriate for the phylogenetic reconstruction of evolutionary relationships within the Solanaceae.

Fig. 8
figure 8

Reconstructed maximum likelihood (ML) phylogenetic tree based on the chloroplast genome sequences of different species of Solanaceae. Helianthus annuus (Asteraceae) was used as an outgroup. Numbers to the right of nodes are bootstrap support values

Discussion

Characteristics of the Capsicum chloroplast genome

Chloroplast genomes can be useful in the investigation of evolutionary relationships in and among plant species [54]. The cp genomes of certain Capsicum species have been previously reported [35,36,37,38], however, despite the global importance of Capsicum as a crop plant, systematic research and in-depth analyses of the evolutionary relationships in Capsicum is lacking. We sequenced the cp genomes of 32 varieties of three Capsicum species (C. annuum; C. chinense and C. frutescens) from six countries. In all three species studied, the cp genome showed a conserved quadripartite structure ranging from 156,729 to 156,950 bp, which is similar to in the cp genomes in most terrestrial plants [55]. A total of 132 genes were found in all three species, which is consistent with results from other studies, including C. annuum var. annuum [33], Capsicum annuum var. glabriusculum [56], C. frutescens [35], C. chinense [36]. C. baccatum [37], C. eximium [38]. Our functional analysis of the Capsicum cp genome also gave results similar to those reported for other species of in this genus [38]. We were able to divide the genes present into three major functional categories, including genes encoding components in the photosynthetic system, the genetic system, and open reading frame and other genes. Analyses of the IR region demonstrated relatively high levels of conservation in the IR/LSC and IR/SSC junction regions between different Capsicum species. We found that the AT content of the Capsicum cp genome was enriched (63%), which is consistent with previous reports from Capsicum [35, 56], and indeed from most higher plants [57]. This might also explain the fact that in our analysis of codon preference, the 3’ ends of most codons containing A or T had an RSCU > 1, and that these codons were preferred.

Genetic variation and genotype in Capsicum varieties

Capsicum is one of the most important spice crop genera worldwide. The genus is thought to have originated and been domesticated in Mexico, and to have secondary centers of origin in Guatemala and Bulgaria [58]. Columbus is credited with bringing Capsicum crops to Europe in the 15th century, from where they spread to Africa and Asia, including India, China and Japan, along the spice routes [59]. These crops therefore have a long history of cultivation in different areas of the world, and as a consequence, there is significant genetic variation in the different areas [60]. Knowledge of the germplasms available is necessary for the scientific breeding of new varieties with improved resistances against disease and adverse conditions. Landraces and cultivars of crops from different areas offer diverse germplasm resources that can be exploited in these attempts [61]. However, the landraces and cultivars that have been bred in Capsicum species have not, to date, been scientifically applied to crop improvement, and require further and extensive genetic diversity studies.

We sequenced the cp. genomes of 32 varieties of Capsicum to investigate genetic diversity in this genus, the 32 varieties only produce only 13 cp. genotypes, that means cp. genome of Capsicum is conservative, and the different phenotypes of varieties may be controlled by nucleotide genome. The nucleotide variability and the sliding window analysis showed that the genomes were relatively divergent despite the relatedness of the study species and most variation occurred in the LSC regions and the non-coding sequences, intron containing genes had higher levels of variability, which is similar to other species reported before [62, 63].

In this study the cp. genomes of three species and 32 varieties of Capsicum were sequenced, and the genetic diversity in this genus was investigated. We found only 13 cp. genotypes in the 32 varieties sampled, suggesting that cp. genome is conserved between Capsicum species, and that the diverse phenotypes observed in different varieties may be controlled by the nuclear genome. However, the nucleotide variability and the sliding window analysis suggested still certain divergence between the cp. genomes, and most variation occurring in the LSC regions and the non-coding sequences. As with the cp. genomes reported from previous studies [62, 63].

The DNA barcoding in the plant genome has been used for identifying species in diverse samples [64]. Many studies considered the regarding locus choice for DNA barcoding, and so many single copy gene locus and combined loci have been proposed to use as DNA barcoding sequences [65]. In cp. genome, locus trnK, trnH-psbA, matK and rbcL were universally recognized as barcode to species identification [66]. However, in our study, the rpl20 gene seems to have the highest discriminatory efficiency in Capsicum identification.

Comparison and phylogenetic analyses of the chloroplast genomes of Capsicum species provides comprehensive insights into the genetic relationship of Capsicum

About 3,000 species have been described from the Solanaceae to date, and the family contains several species of economically important plants, such as tomato, eggplant, potato and tobacco [67]. In our study of the phylogenetic relationships within the genus Capsicum and between Capsicum and other species in the Solanaceae, we reconstructed a phylogenetic tree using the cp. genomes of three species of Capsicum, and 40 further species from the Solanaceae. Capsicum formed a monophyletic group with high statistical support. Liu et al. [60] investigated the domestication and population differentiation in peppers, and also reconstructed the phylogenetic relationship within this genus using resequencing data. In the study of Liu et al. [60], C. pubescens was found to be sister to the other six species, with C. baccatum var. baccatum, C. baccatum var. pendulum and C. chacoense forming one clade (the baccatum clade), and with a second clade (the annuum clade) comprising C. annuum var. annuum, C. annuum var. glabriusculum, C. chinense, C. frutescens and C. galapagoense. Although the species that we used in our study differed slightly from these, we found that the phylogenetic relationships were similar, and also demonstrated the close relationship of the species in the pungent group (C. chinense and C. frutescense). Our phylogeny also revealed with the position of C. chacoense, C. tovarii, and C. eximium for the first time. However, In Solanaceae, we also found certain conflict relationships between the cp. genome tree generated in our ML analysis and the classical classification system [68,69,70], for example, neither Dunalia nor Solanum formed a monophyletic group in our analysis. Further investigation is necessary to determine whether some groups should be renamed.

The structure of the cp. genome is conserved, although sufficient sites show variation that the cp. genome is nevertheless informative in analyses of the phylogenetic relationships within the genus Capsicum. However, phylogenies constructed using cp. genomes may have limitations following extensive interspecific or intergeneric hybridization of the study species, of because of introgression or incomplete lineage sorting [71]. The inclusion of further Capsicum species and varieties in the analyses, together with a study of the morphology, biogeography and history of domestication of this group should result in a more robust phylogeny. It is our belief that this analysis of cp. genomes in Capsicum will provide a theoretical background for further research in this important genus. Conducting larger-scale comparative analyses with more Capsicum varieties is also the possible future research directions.

Conclusion

The cp. genome of 32 Capsicum varieties were newly sequenced and the genetic diversity and relationship of these varieties were revealed in this study. The results showed that the Capsicum varieties cp. genomes structure were relative conserved. 32 varieties produced 13 genotypes. A total of 608 indels, 83 SNPs and 47 SSRs have been identified which can be used as molecular markers in a future Capsicum diversity study, as well as the high variation region such as rpl20 and trnK gene. The phylogenetic reconstruction based on the cp. genome of Solanaceae data generally reflected the currently accepted classification, with the species of the pungent group having close relationship with one another. Our results enrich the data on the cp. genomes of the important vegetable genus and play an important role for the molecular identification and phylogenetic reconstruction of Capsicum species. The genetic diversity between varieties may guide future research on the adaptive evolution of Capsicum species, and the chloroplast genome data generated in this study can be used to improve Capsicum breeding programs and develop new varieties with enhanced traits.