Introduction

Early life history stages of the majority of marine fishes occur in the upper water column, where eggs and larvae are part of the plankton before they undergo multiple metaphormoses and settle in their respective adult habitats. Large-scale surveys of the ichthyoplankton are fundamental to understand ecosystem functioning and fish population dynamics [1, 2], and to help with the sustainable management of fisheries and the design of reserve networks to preserve ocean biodiversity [3, 4]. Survey data of the ichthyoplankton could be used for monitoring spawning habitats [5], detecting changes in phenology related to human-induced perturbations [6,7,8,9], and estimating the biomass of spawning adults [10]. The greatest challenge when collating larval fish data is the difference in taxonomic resolution among studies (i.e. species vs genus vs family level identification), which limits the ability to make large-scale comparisons of ichthyoplankton assemblages. Identifying fish early stages is a challenging task as few ichthyologists are able to identify larvae to the species level, and considering the progressive loss of taxonomic expertise (i.e. taxonomic impediment), the sustainability of this expertise is compromised in a near future [11].

Empirical studies of larval ecology face a major challenge which consists in accurately identifying larvae and eggs to the species level using morphological characters. This task is particularly challenging due to the high diversity of species usually encountered in marine ichthyoplankton swarms and the dramatic phenotypic changes during the fish life cycle [12]. As such, identifying fish larvae is no easy task. Consequently, there is an urgent need for a reliable and efficient approach to achieve accurate fish larvae identification. The use of standardized molecular approaches for species identification such as DNA barcoding can greatly improve the identification of ichthyoplankton [13], and potentially reduce the reliance on taxonomic experts [14]. It has been evidenced for instance that DNA barcoding could enhance the accuracy of species-level identification of marine ichthyoplankton by 70% [15]. By design, its accuracy is highly dependent on the completeness of the DNA barcode reference libraries used to assign unknown larval fish and eggs to known species [13]. These libraries turn surveys of Molecular Operational Taxonomic Units (MOTUs) into species surveys through the assignment of species names to MOTUs [16], hence giving meaning to molecular data for ecologists, evolutionary biologists and stakeholders [17]. Numerous studies have demonstrated the advantages of integrating morphological and molecular approaches in the ichthyological exploration of the world oceans. In the Pacific Ocean for instance, the Moorea Biocode project established the initial comprehensive DNA barcode reference library of the Pacific reef fishes [18]; and its application to the identification of marine larval fish assemblages in tropical and subtropical Pacific waters [13, 14].

The South China Sea (SCS) constitutes a major marine biodiversity hotspot with over 3300 fish species reported [19, 20], which are threatened by illegal fishing, disappearance of mangroves and wastewater emissions. Besides, the degradation of coral reefs in the SCS over the past few decades has posed a serious threat to the persistence of multiple fish populations. Hence, protecting fish resources in the SCS is an urgent imperative. Larval fishes and fish eggs, despite offering limited dispersal opportunities for many sedentary marine fish species, provide vital information on reproductive biology, including spawning ground and timing, and population recruitment success rates. Nonetheless, research on larval fishes and eggs remains significantly limited in the SCS, mostly scattered in the literature and plagued by misidentifications [21,22,23]. This situation is likely explained by the lack of a dedicated database to provide reliable references on larval fishes and eggs in the SCS.

Here, we present the result of a large-scale effort to DNA barcode larval fishes and fish eggs of the SCS with the objective to provide open resources for automated identifications of larval fishes and eggs, and promote studies of early life stages among marine fishes of the SCS (Fig. 1). Conducted between 2022 and 2023, a total of 63 sites were inventoried across the SCS (Fig. 2). In total, 741 specimens were identified, preserved, photographed and DNA barcoded to build an atlas guide representing at least 12 orders, 60 families, 113 genus and 188 species (Fig. 3) including 113 species of larval fish (Fig. 3A1-A2) and 85 species of fish eggs (Fig. 3B1-B2). Simultaneously leveraging data from current sampling efforts (188 species) and from DNA barcodes previously published in the literature [22,23,24,25] and retrieved from the NCBI and BOLD databases (166 species), we have constructed the first comprehensive larval fish and fish eggs DNA barcode reference library for the SCS. Our library includes 1255 sequences for 20 orders, 80 families, 193 genus and 308 species, including 471 DNA barcodes for 144 species which constitutes new records (Fig. 4). This comprehensive library for larval fishes and fish eggs is poised to be used by a diversity of users with varying interests, from fundamental to applied science, including fisheries management, functional ecology, taxonomy, and conservation. Additionally, the present library reveals numerous newly detected taxa for scientific exploration, along with complete collection data and DNA barcodes that will certainly facilitate their formal description as new species. Beyond shedding a new light on the fish species diversity of the SCS, this publicly available resource is anticipated to catalyse the development of DNA barcode reference libraries in the SCS. Furthermore, it is expected to enhance the accuracy of results for the growing number of studies utilizing DNA barcoding in the Western Pacific.

Fig. 1
figure 1

Overview of data generation. From specimen collection to the validation of data generation

Fig. 2
figure 2

Map of sampling localities. Circles denote locations where zooplankton nets were used for capture, while triangles represent sampling sites with published sequences

Fig. 3
figure 3

Photographs of larval fish and fish eggs. A larval fish; B fish eggs

Fig. 4
figure 4

Species diversity included in the Biodiversity of South China Sea dataset. A Number of species by family; B Number of newly sequenced species in each family in present dataset, New species records for any mitochondrial gene (green bars); New species records for COI marker in the South China Sea(grey bars)

Materials and methods

Specimen collection

The collection of larval fish and fish eggs was conducted in accordance with the “Marine Survey Code” (GB12763.1-7-91), utilizing a large zooplankton net equipped with a mechanical flowmeter (HYDRO-BIOS). All the ichthyoplankton samples examined and analyzed here were newly collected during the course of the present study using zooplankton nets (80 cm diameter, 270 cm long, 505 μm mesh size, with a cod-end container mesh of 400 μm) deployed in vertical and horizontal trawls in open seas. Two simultaneous vertical and horizontal tows were used at fixed depth (15 min at 1.5–2.2 knots), using a flowmeter attached to the mouth of the net, which was used to standardise Ichthyoplankton counts to volume of water sampled. According to the specifications for oceanographic survey — Par6: marine biological survey (GB/T 12763.6—2007): the collected samples were preserved separately in 75% ethanol–sea water solution and 5% formalin solution in seawater. These preserved samples were transported back to the laboratory for further analysis.

DNA barcode sequencing and data mining

In the laboratory, larval fishes and fish eggs from each station were examined using an Olympus SZX7 stereomicroscope, sorted, enumerated, and identified. Identification of species was done using key reference guides [26,27,28,29], larval fishes and fish eggs were then stored in ethanol for later reference, and a subset has been archived at the Guangdong Ocean University.

Each sample was first numbered, rehydrated in ultrapure water for 5–8 min for cleaning, and then photographed using a Zeiss microscope (Stemi 508) (Fig. 3). Total genomic DNA was extracted using an Aagen DNA Extraction Kit (Aagen, Guangdong, China) according to the manufacturer specifications. A partial fragment of the 5’-end of mitochondrial cytochrome c oxidase I gene (COI) of ∼650 bp was amplified with the universal primers FishF1(TCAACCAACCACAAAGACATTGGCAC) and FishR1(TAGACTTCTGGGTGGCCAAAGAATCA) [30]. The polymerase chain reaction (PCR) employed the following thermocycling regime: 92C for 5 min, 35 cycles at 92C for 45 s, 49C for 60 s, 72C for 60 s, and a final extension at 72C for 10 min. PCR amplifications were performed in a final volume of 25 μl containing 6.5 μL of ultrapure water, 2 μL of MgCl2 (5 mM), 1 μL of forward and reverse primers, 12.5 μL of Taq polymerase and 2 μL of genomic DNA. Sanger sequencing was conducted at the Sangon Biotech company. Sequences were edited using Sequencher_5.4.5 (Gene Codes), aligned with MAFFT [31] and submitted to Genbank (accession numbers: PP354153-PP354861 and PP354116-PP354147; Table S1).

The taxonomic coverage of the DNA barcode reference library was further expanded by mining COI sequences for missing taxa in international repositories such as GenBank and BOLD. We employed two retrieval strategies: (1) In GenBank, we searched and downloaded COI sequences using a combination of keywords including ‘South China Sea’, ‘fish larvae’, ‘fish eggs’ and ‘fish larvae and eggs’; (2) We searched for studies in Web of Science matching the aforementioned keywords to identify and review all relevant publications, and compiled all available COI sequences.

Genetic species delimitation

Several methods have been suggested for delineating species based on DNA sequences [17, 32, 33]. Each of these methods possesses distinct properties, especially in handling singletons (i.e., delimited lineages represented by a single sequence) or heterogeneous speciation rates among lineages [34]. A combination of different approaches is increasingly used to overcome potential pitfalls arising from uneven sampling [35,36,37,38]. We used six different sequence-based methods of species delimitation to identify Molecular Operational Taxonomic Units (MOTUs): (i) Refined single linkage (RESL) as implemented in BOLD and used to generate Barcode index numbers (BIN) [17], (ii) Assemble Species by Automatic Partitioning (ASAP) [33], (iii) Poisson tree process (PTP) in its single (sPTP) and multiple rates version (mPTP) as implemented in the stand-alone software mPTP_0.2.3 [32] and (iv) general mixed yule-coalescent (GMYC) in its simple (sGMYC) and multiple rate version (mGMYC) as implemented in the R package Splits 1.019) [39]. The final delimitation scheme was established by deriving a majority-rule consensus from the six delimitation analyses conducted.

Both Refined Single Linkage (RESL) and ASAP utilize DNA alignments as input, with sequences being submitted to the BOLD and ASAP web servers respectively, for delimitation analysis. For PTP analyses, a Maximum Likelihood (ML) tree was generated using the R package Phangorn 2.8.1 [40] with a GTR + F + R10 substitution model. Subsequently, the ultrametric and fully resolved tree required by GMYC analyses was reconstructed using the Bayesian approach implemented in BEAST 2.4.8 [41]. Two Markov chains of 50 million each were run independently employing a Yule pure birth model tree prior, a strict-clock of 1.2% of genetic distance per million years [42], and a GTR + F + R10 substitution model. Trees were sampled every 10,000 states after an initial burn-in period of 10 million. Both runs were first checked for statistical robustness (ESS > 200) using Tracer 1.7.1 and further combined using LogCombiner 2.4.8. The maximum credibility tree was established using TreeAnnotator 2.4.7 [41]. Sequences were collapsed into haplotypes prior to Bayesian analyses using the ALTER webserver (http://www.sing-group.org/ALTER/) [43]. Subsequently, we used the R package P2C2M.GMYC 1.0 (https://github.com/P2C2M) to assess the fit of the GMYC model to our constructed dataset.

The taxonomic coverage of present sampling was further examined at both species and MOTUs levels by generating a sequence accumulation curve using the R package iNEXT [44].

Specimen identification and genetic distances

Morphological identifications were refined by comparisons to sequence-based identifications performed with the blast engines of NCBI nucleotide database and the Barcode of life datasystem (BOLD). Blast results were collected for both best match and interspecific best match following Hubert et al. [13]. Specimen identifications to the species level were collected when the best match was above, and the interspecific best match below, a similarity threshold of 98%.

Kimura 2-parameter (K2P) [45] pairwise genetic distances were calculated utilizing the R package Ape 4.1 [46]. Maximum intraspecific and nearest-neighbor genetic distances were calculated using the matrix of pairwise K2P genetic distances and the R package Spider 1.5 [47]. We checked for the presence of a barcode gap, i.e. the lack of overlap between the distribution of the maximum intraspecific and nearest neighbour genetic distances, by plotting both distances and examining their relationships on an individual basis rather than comparing both distributions independently [48]. The barcode gap was examined for both species and MOTUs, and a neighbour-joining (NJ) tree, constructed based on K2P distances, was also generated for a visual inspection of genetic distances and DNA barcode clusters. We used the K2P distance in all distance-related metrics in order to account for biased transition/transversion ratios and make our study comparable with similar metrics in the literature as K2P is a widely used model to compute genetic distances in DNA barcoding studies.

Finally, haplotype networks were reconstructed using the statistical parsimony network approach to visually examine cases of closely related species exhibiting haplotype sharing and/or mixed genealogies with the software Network 4.6 [49]. The haplotype networks were reconstructed including newly generated sequences as well as additional sequences mined from public database.

Results

A total of 741 COI sequences were produced based on the samples collected from the 63 sites visited in the SCS between 2022 and 2023 (Table S1; Fig. 2). All sequences were above 600 bp long and had no stop codons or insertions/deletions, indicating that the collected sequences represent functional coding regions. Alongside newly generated DNA barcodes, 514 sequences belonging to 116 species, 129 genera, 63 families and 19 orders within 185 collection sites were mined from GenBank and BOLD. After aligning newly generated and mined sequences, the final alignment consisted of 1255 sequences of 648 bp from 248 sites in the SCS.

MOTU delimitation analyses yielded varying numbers of MOTUs according to methods with 302, 287, 308,197, 304 and 316 MOTUs delimited by BIN, ASAP, sPTP, mPTP, sGMYC and mGMYC respectively. The results of the P2C2M.GMYC analysis indicate that our dataset did not violate GMYC assumptions (p = 0.7), implying that the GMYC model is applicable to our dataset. The final consensus, based on a majority rule, consisted of 305 MOTUs (Fig. 5). BLAST analyses in NCBI and BOLD were congruent and a total of 131 sequences could be unambiguously identified to the species level using a 98% similarity threshold (Table S2; Fig. 6, case I), corresponding to 58 species. A total of 583 sequences presented could be identified to the species level by the 98% similarity threshold, however, the best interspecific match was below 98% similarity as well (Fig. 6, case II). Finally, 27 sequences could not be identified to the species level.

Fig. 5
figure 5

Bayesian chronograms based on a 1.2% of genetic divergence per million years including DNA-based species delimitation derived from ASAP, BIN, sPTP, mPTP, sGMYC, mGMYC and final delimitation schemes based on majority rule consensus among the six methods for larval fish and fish eggs

Fig. 6
figure 6

Identifications using the BOLD database; best match compared with nearest neighbour (similarity percentage) for each specimen in BOLD

When combining morphological and molecular identifications, a total of 188 species belonging to 113 genera, 60 families and 12 orders were identified among the 741 newly generated DNA barcodes (Table S1, Fig. 4). Among these 188 species, 121 were represented by larval fish samples and 99 species by fish eggs samples. Together with the 514 DNA barcodes mined from Genbank and BOLD, the 1255 DNA barcodes collected here belong to 308 species, 193 genera,80 families and 20 orders (Table 1). The number of sequences per species varied from 1 to 72, with an average of 4 sequences per species. Mean genetic divergence was 17.68% (0% –34.03%) within family and 8.50% (0% – 25.57%) between genera within family (Table 2). Maximum intra specific distances ranged between 0% and 21.40%, and nearest neighbour distances ranged from 0 to 23.55% (Table 2). Nearest neighbour distance was 36.71-fold higher than maximum intraspecific distance on average, with an index ratio ranging between 0 and 122.94.

Table 1 Taxonomic coverage across orders, families, genera and species of the present study
Table 2 Summary statistics of pairwise K2P genetic distances among sequences at the species, genus, and family level

Plotting maximum intraspecific and nearest neighbor K2P genetic distances revealed the absence of a barcode gap, as maximum intraspecific genetic distances surpassed distances to the nearest neighbour in several cases (Fig. 7). However, a barcode gap was observed for MOTUs (Fig. 7A, B, D). A total of 9 species displayed lower nearest neighbour K2P distance than their maximum intraspecific distances (Thunnus albacares, Ostorhinchus kiensis, Hypoatherina valenciennei, Arnoglossus polyspilus, Coryphaena hippurus, Sardinella jussieu, Thryssa kammalensis, Nectamia fusca, Rhabdamia gracilis) and nearest neighbour K2P distances below 1 per cent of pairwise distance were observed for 27 species with 1 species displaying a K2P genetic distance of 0 to their nearest phylogenetic relative (Thryssa hamiltonii). In addition, 11 species displayed maximum intraspecific distances above 2%, including Ostorhinchus fasciatus (3.22%), Ceratoscopelus warmingii (3.97%), Scatophagus argus (6.68%), Terapon jarbua (7.59%), Platycephalus indicus (8.10%), Engyprosopon latifrons (8.42%), Cynoglossus macrolepidotus (14.33%), Eleotris oxycephala (19.03%), Thryssa kammalensis (20.65%), Nectamia fusca (20.90%) and Rhabdamia gracilis (21.40%) (Table S3).

Fig. 7
figure 7

Distribution of K2P genetic distances. A distribution of the maximum K2P genetic distances within MOTUs; B distribution of the minimum K2P genetic distances to the nearest MOTU. The dashed line highlights the ‘barcoding gap’ between the distributions of maximum intra-MOTU and maximum inter-MOTUs distances; C relationships between the maximum intraspecific and nearest-neighbour (NN) K2P genetic distances, D relationships between the maximum intra-MOTU and nearest-neighbour (NN) for MOTU K2P genetic distances

Upon visually examining the NJ tree constructed using K2P genetic distances, it appears that several discrepancies between species and MOTUs are detected with 14 species displaying multiple MOTUs (Table 3) and 23 species displaying mixed genealogies within 11 MOTUs shared by more than one species (Table 4; Fig. 5). Accumulation curves indicate that the newly generated set and the entire dataset are far from reaching a plateau for larval fish and fish eggs, suggesting that the number of species recovered in this study underestimates the true early-stage resources diversity in the South China Sea (Fig. 8).

Table 3 List of potential cryptic species including their barcode index number in BOLD, maximum K2P genetic distances within a MOTU (DM), and K2P distance to their nearest MOTU (DN). The nearest-neighbour distance for each MOTU corresponds to the distance to the closest MOTU in this study
Table 4 List of potential MOTUs shared by multiple species including their maximum K2P genetic distances within a MOTU (DM), K2P distance to their nearest MOTU (DN), and a list of species detected for each MOTU. The nearest-neighbour distance for each MOTU corresponds to the distance to the closest MOTU, disregarding species boundaries
Fig. 8
figure 8

Accumulation curves for species richness in the present sampling. A accumulation curve for larval fish and eggs for the newly generated record, B Fish early-stage resources for the entire dataset. Solid and dotted lines represent current and extrapolated trends, respectively

Haplotype networks were reconstructed for 23 species displaying shallow genetic divergence and/or haplotype sharing. Scattered haplotypes across haplotype networks were observed for the MOTU including Alepes djedaba, Alepes kleinii and Selaroides leptolepis (MOTU11), the MOTU including Hirundichthys affinis and Hirundichthys oxycephalus (MOTU118), the MOTU including Prognichthys brevipinnis and Prognichthys sealei (MOTU230), the MOTU including Sardinella gibbose and Sardinella jussieu (MOTU245), the MOTU including Saurida tumbil and Saurida undosquamis (MOTU249), and the MOTU including Thunnus albacares and Thunnus tonggol (MOTU287). A single case of haplotype sharing was observed between Thryssa hamiltonii and Thryssa kammalensis (MOTU284), with a haplotype placed in central position in haplotype network (Fig. 9).

Fig. 9
figure 9

Median-joining networks for species groups with mixed genealogies and/or sharing haplotypes for COI sequences

The maximum credibility tree reconstructed with BEAST 2.4.8 for the 14 species displaying multiple and deeply diverging MOTUs revealed a diversity of the phylogeographic patterns (Fig. 10). Two lineages were detected in Ceratoscopelus warmingii, Coryphaena equiselis, Cynoglossus macrolepidotus, Eleotris oxycephala, Engyprosopon latifrons, Eviota shimadai, Nectamia fusca, Ostorhinchus fasciatus, Platycephalus indicus, Rhabdamia gracilis, Scatophagus argus, Terapon jarbua, Thryssa kammalensis and Upeneus japonicus. In most of the cases where two MOTUs are detected within species, a South–North or an East–West differentiation was observed, as exemplified by S. argus, E. oxycephala and T. kammalensis (South–North) (Fig. 10E, J, N), as well as E. latifrons, C. equiselis and R. gracilis (East–West) (Fig. 10B, C, D). In E. latifrons and C. equiselis, one lineage was distributed in the Northwest SCS including the Beibu Guif, and Near Hainan Island, and another lineage is occurring largely in Zhongsha Islands or Near Hainan Island (Fig. 10B, C). In R. gracilis, the two lineages are located in the Beibu Gulf and the Pearl River estuary, respectively (Fig. 10D), the two lineages of S. argus, E. oxycephala and T. kammalensis exhibit distinct geographic isolation along the Sunda Shelf (Fig. 10E, J, N). However, the co-occurrence of haplotypes from distinct MOTUs is observed in some regions of the Beibu Gulf and near the Hainan Island. Alternatively, allopatric distributions of conspecific MOTUs involved different patterns with: (1) the Beibu Gulf vs. Pearl River and Paracel Islands for the MOTUs within R. gracilis and E. latifrons (Fig. 10D, B), (2) Malaysia vs. the Beibu Gulf and Pearl River for S. argus, E. oxycephala and T. kammalensis (Fig. 10E, J, N). The estimated divergence time suggest that most MOTU divergence events originated during the Pliocene, but a few noticeable exceptions are detected within C. macrolepidotus, E. latifrons, P. indicus and U. japonicus with MOTU divergence happening before the Pliocene (Fig. 10).

Fig. 10
figure 10

Phylogeographic patterns among selected groups of species with multiple MOTUs. MOTUs are represented according to the final delimitation schemes based on majority rule consensus among the 6 methods. Different colours represent different species, from top to bottom as C. macrolepidotus, E. latifrons, C. equiselis, S. argus, T. kammalensis, R. gracilis, N. fusca, O. fasciatus, C. warmingii, P. indicus, E. oxycephala, E. shimadai, U. japonicu and T. jarbua, respectively, and on the right side, the geographical patterns correspond to the colours of the respective species. Trees at the bottom right of each map are neighbour-joining trees of the corresponding species using K2P distances. Different colour and symbol (circle vs triangle) represent different lineages and/or MOTUs. Scale bars correspond to K2P genetic distances(the geographical location of one of the lineages has been circled)

Discussion

The present study provides the first comprehensive assessment of DNA barcoding for the identification of larval fish and fish eggs assemblages of the SCS, including individual photographs of the 113 species of larval fish and the 85 species of fish eggs sampled here, as well as a DNA barcode reference library for 308 species. This study also provides a compilation of published larval fish and fish egg COI sequences from the SCS [22,23,24,25], providing the largest DNA barcode reference library published so far for SCS fish early stages. Besides, major regions of the SCS were covered including the Beibu Gulf, Near Hainan Island, Near, Qiongzhou Strait, Guangdong coastal, Pearl River, Zhongsha Islands and Malaysia, which allows examining the impact of geographic structure on the performance of DNA barcoding.

Accuracy of DNA barcoding for the SCS larval fish and fish eggs

Since the inception of DNA barcoding by Hebert et al. [50], it has been increasingly used as a standardize molecular methods of species identification and numerous studies have demonstrated how DNA barcoding can help accelerate the pace of species discovery [13, 35,36,37,38, 51]. Our study confirms the benefits of integrating DNA barcoding into the taxonomic workflow of a biodiversity inventory in species-rich, yet complex biotas. In the case of complex species assemblages where identifications are challenging and taxonomic controversies are present, a combination with detailed morphological comparisons is necessary [52]. Besides, the accuracy of a DNA barcode reference library is also tightly dependent of the accuracy of morphological identifications [16]. Here, we ensured the accuracy of our library by including four steps during the identification procedures. First, larval fish and eggs were sorted according to their morphological attributes using the most updated field guides available. Second, previously published DNA barcode for SCS fish early stages were mined from BOLD and NCBI. Third, DNA barcodes of individual larval fishes and eggs were blasted in NCBI and Genbank to collect a molecular identification following the methodology previously proposed by Hubert et al. [13]. Blast results were sorted according to a threshold of 99% of similarity, and three categories were determined including match to species (case I), ambiguous match to species (case II) and unmatched (case III) [13, 22]. Cases of ambiguous match were further examined by comparison with sequences mined from Genbank and BOLD, and the origin of the ambiguity was determined i.e. species synonymy, shallow divergence or lineage sorting. Fourth, blast results were finally compared to morphological identification and morphological attributes were finally re-examined at the light of DNA barcoding. Here, unambiguous match to species accounted for only 63.29% of the samples, but our iterative procedure helped in improving overall identification results.

Species divergence in the SCS

This study provides molecular evidence for the presence of 305 MOTUs whose delimitation was corroborated by most DNA-based delimitation methods applied. Several instances of large conflicts between mPTP and other algorithms were associated to cases of multiple MOTUs displaying small genetic distances among them (e.g. Abudefduf septemfasciatus and A. vaigiensis MOTUs). This is a known trend of mPTP which tend to underestimated diversity if diversification rates largely varies among the lineages analyzed [35, 37]. These findings affirm the advantages of integrating multiple species delimitation methods and opting for a consensus approach instead of relying on a single method to avoid artifacts [35, 36, 38, 52]. These methods helped characterized the diversity among larval fish and eggs of the SCS, and highlight multiple cases of shallow divergence among closely related species or multiple cryptic MOTUs within species related to the geographic structure of the SCS.

The accuracy of DNA barcoding in identifying unknown specimen to the species level is tightly linked to the taxonomic coverage of the fauna under scrutiny and the spatial coverage of genetic diversity for widespread species [35]. Spatial scale is particularly important for the detection of a barcode gap, as increasing spatial scale may result in increasing maximum intraspecific genetic distances but also result in decreasing the distance to the nearest neighbour by increasing taxonomic coverage [36]. Comparisons with other large-scale DNA barcoding campaign are consistent as the average intraspecific genetic distance estimated here (0.13%) is lower than observed elsewhere at wider spatial scale such as the Mediterranean (0.39%), Australian shores (0.39%) and the Indo-Pacific (> 1%). However, with an average genetic distance among congeneric species of 8.5%, the SCS exhibits shallower divergence among congeneric species than elsewhere in the Mediterranean (8.91%), Australian waters (9.93%), or Indo-Pacific (> 14%) [18, 53, 54]. This trend suggests that the SCS hosts more closely related species than other DNA-barcoded regions of the Pacific. The variations observed between genera may be indicative of the average age of congeneric species divergence, as some species are younger than others within genera, and also in comparison to other genera [30]. This trend suggests that the SCS served as a diversification hotspot during the Pliocene, with increased speciation rates compared to other marine fish assemblages.

Shallow divergence and haplotype sharing

DNA-based species delimitation analyses converged with specimen identifications in 285 species where a single MOTU was delimited within a nominal species. This indicates a success rate of 93.44%, which was comparable with other large-scale studies [51, 55]. In total, 305 MOTUs were detected among the 308 species detected, and 11 MOTUs displayed more than a single species with 23 species. Taking into account the maternal inheritance of mitochondrial genes and the shallow genealogies observed here, the mixing of species genealogies in those 11 cases can be attributed to either recent divergence and incomplete lineage sorting or historical introgressive hybridization [36, 37]. Shallow divergence predominantly occurs among closely related species within the same genus (e.g. Alepes, Prognichthys, Saurida, Thryssa, Hirundichthys, Sardinella, Thunnus) but a few cases involving genus paraphyly were detected (Alepes and Selaroides). This shallow genetic divergence was accompanied by an apparent morphological similarity among species. This trend is consistent with the role of the SCS as a hotspot of diversification for marine fishes [56].

A single case of haplotype sharing is observed between Thryssa hamiltonii and T. kammalensis, with a single shared haplotype in a central position of their reconstructed haplotype network, suggesting incomplete lineage sorting instead of introgressive hybridization. This shared ancestral haplotype is located in the Beibu Gulf, while more recently derived haplotypes occurring in Malaysia for both species were not shared, suggesting a common origin in the Beibu gulf. However, this hypothesis calls for a broader study of this species pair using molecular markers of biparental inheritance.

Cryptic diversity and phylogeographical patterns

Several cases of multiple, highly divergent MOTUs were detected in 14 species, including C. warmingii, C. equiselis, C. macrolepidotus, E. oxycephala, E. latifrons, E. shimadai, N. fusca, O. fasciatus, P. indicus, R. gracilis, S. argus, T. jarbua, T. kammalensis and U. japonicu with a total of 28 MOTUs (Table 4). Most intraspecific lineage divergences are dated to the Pliocene, suggesting the influence of historical geological events (Fig. 10I). In C. warmingii, C. macrolepidotus, E.oxycephala, E.latifrons, N.fusca, O.fasciatus, P.indicus, R.gracilis, S.argus, T.jarbua and T.kammalensis, high divergence (> 3% maximum intraspecific genetic distance) between MOTUs is observed. In C. macrolepidotus, O. fasciatus, N. fusca, U. japonicu, P. indicus, C. warmingii and T. jarbua, intraspecific MOTUs display alternative geographical distributions in the Beibu Gulf, or the Beibu Gulf and the coastal areas in eastern Guangdong and the waters near Hainan Island. By contrast, E. shimadai has two distinct MOTUs observed in the coral reefs of the Zhongsha Islands, which are known to host a high diversity of species [23, 57]. For E. latifrons, C. equiselis and R. gracilis, two MOTUs were identified from the Beibu Gulf and the coastal areas along the eastern sides of Guangdong and Hainan sampling sites, with one MOTU in the Beibu Gulf and the other in the coastal areas along the eastern sides of Guangdong and Hainan. These distribution patterns are consistent with the influence of the terrestrial isolation, elevation of which created a biogeographical barrier separating the Beibu Gulf and the coastal areas along the eastern sides of Guangdong and Hainan, a scenario previously proposed by the phylogeographical studies of multiple fish taxa [58]. Hainan Island started to separate and drift away from the northern part of the South China Sea (Beibu Gulf) approximately 65 million years ago (ma) [58], a time frame matching our divergence time estimates among these MOTUs, dated at 32.99–41.92 Ma (C. equiselis ~ 32.99ma, E. latifrons ~ 41.92ma). Likewise, the two MOTUs detected in S. argus, E. oxycephala and T. kammalensis show marked geographical distribution patterns, with one MOTU located in Malaysia and another one situated in the northern part of the South China Sea. This pattern is consistent with the influence of the Mid-Indian Ocean Barrier (MIOB) on the divergence of MOTUs within widespread species [59, 60]. The MIOB has been identified as a strong barrier to gene flow for marine organisms, which also indicates that within a highly connected marine environment, older geographical barriers can promote divergence among clades.

Conclusions

Our study provides the most comprehensive DNA barcoding campaign for larval fish and fish eggs assemblages to date in the SCS, with several findings challenging current taxonomic knowledge. To ensure the accuracy of our identifications, a multiple step procedure was implemented which largely helped in improving the resolution of the identification procedure. The present study underscores the integrity and accuracy of the database in the SCS and highlights the significance of this reference library, as well as confirms the utility of standardized DNA-based species delimitation methods in aiding biodiversity inventories and species identification. Conflicts detected between species boundaries and sequence-based species delimitation methods point to relatively shallow inter-specific differences among species for 23 species. At the other end of the spectrum, the detection of multiple, and highly divergent, MOTUs in 14 species suggests that the diversity of SCS fishes is currently underestimated and potentially new species are awaiting a formal description. This pattern point to the influence of several geographical barriers, fostering the emergence of cryptic diversity and unrecognized divergence events [18, 61]. We confirmed the influence of geographic structure on this recent diversification, a trend that further points to the need to improve our knowledge of this biodiversity-rich region. As such, this study warrants further research developments and provides guidelines for future taxonomic studies, sustainable management and conservation of fishery resources in the SCS.