Long-read metabarcoding: from available tools to reference databases
Laen...
Kuupäev
Autorid
Ajakirja pealkiri
Ajakirja ISSN
Köite pealkiri
Kirjastaja
Tartu Ülikooli Kirjastus
Abstrakt
Traditsioonilised meetodid, nagu morfoloogia abil liikide määramine koosluste monitooringuks, on sageli aeganõudvad, eriti mikroskoopiliste organismide puhul, mistõttu on mass-triipkoodistamine (metabarcoding) läbi mass-sekveneerimise saanud populaarseks, kiireks ja kulutõhusaks erinevate koosluste tuvastamise meetodiks. Enim kasutatavad tehnoloogiad mass-triipkoodistamise töövoos on nn teise põlvkonna mass-sekveneerimise platvormid. Kuigi need suudavad genereerida miljoneid kõrge täpsusega DNA fragmente, on teise põlvkonna tehnoloogiate poolt sünteesitud järjestused suhteliselt lühikesed, mis võib limiteerida lähedalt suguluses olevate liikide eristamist. Kolmanda põlvkonna mass-sekveneerimise tehnoloogiad suudavad järjestada palju pikemaid DNA lõike, mis hõlmavad terveid geeniregioone, parandades seeläbi taksonoomilise eristamise võimekust. Suhteliselt uudne võimekus toota palju pikemaid liikide määramiseks sobilikke DNA lõike toob aga kaasa ka uusi analüütilisi väljakutseid: paljud olemasolevad bioinformaatika tööriistad on välja töötatud lühikeste järjestuste analüüsiks, pikkade järjestuste jaoks puuduvad põhjalikud referentsandmebaasid ning kimäärsete (mitte-bioloogiliste) järjestuste moodustumine võib pikkade järjestuste sekveneerimiseks genereerimise käigus olla problemaatilisem.
Käesolev doktoritöö annab esiteks ülevaate paljudest olemasolevatest bioinformaatika töövoogudest pakkudes praktilist juhtnööri sobivate bioinformaatiliste tööriistade valimiseks lähtuvalt analüüsitavast andmestruktuurist. Teiseks, töötati välja EUKARYOME andmebaas, mis on esimene kureeritud pikkade ribosomaalse RNA markerite referentsandmebaas, hõlmates üle 172000 liigi. Kolmandaks, antud doktoritöö käigus leiti, et olemasolevad kimäärsete DNA järjestuste tuvastamise algoritmid klassifitseerivad paljusi bioloogilisi järjestusi ekslikult kimääridena, ehk valepositiivsete tuvastuste määr on vaikimisi sätetega suur. Parameetrite peenhäälestamine ja sekundaarsed valideerimisstrateegiad aga parandasid analüüside täpsust. Ühiselt, need doktoritöö tulemused ja ressursid edendavad pikkade järjestuste mass-triipkoodistamise töövoogu kui elurikkuse hindamise usaldusväärset tööriista.
Traditional approaches for monitoring species diversity based on visual identification are time-consuming, require specialized taxonomic expertise, and often fail to identify cryptic species or juvenile life stages. DNA metabarcoding has revolutionized biodiversity assessments by enabling simultaneous identification of multiple species from mixed environmental samples using standardized genetic markers. While second-generation sequencing platforms can generate millions of DNA fragments with high accuracy, they are constrained by short read lengths that limit species-level resolution. Third-generation sequencing technologies can produce reads long enough to span complete gene operons, including the full ribosomal RNA complex, thereby enhancing taxonomic resolution and providing more information for accurate species identification. However, this shift toward long-read sequencing introduces analytical challenges: many bioinformatics tools were developed for short-read data, comprehensive reference databases for full-length sequences are lacking, and long amplicons are more susceptible to chimeric artifacts, which are artificial sequences formed during PCR amplification. This doctoral thesis addresses these challenges through three interconnected studies. First, a comprehensive review of available bioinformatics pipelines provides practical guidance for selecting appropriate tools based on sequencing platforms, data structure, and computational expertise. Second, the EUKARYOME database was developed as the first curated reference containing full-length ribosomal RNA sequences for all eukaryotes, covering over 172,000 species and enabling accurate taxonomic identification and chimera validation for long-read data. Third, evaluation of common chimera detection algorithms revealed high false-positive rates, where genuine biological sequences are incorrectly classified as artifacts. However, parameter tuning and secondary validation strategies can effectively reduce errors while conserving genuine sequences. Importantly, while method-specific biases affect taxonomic composition, their impact on community-level patterns remains limited. These findings collectively advance long-read metabarcoding as a robust tool for biodiversity assessment.
Traditional approaches for monitoring species diversity based on visual identification are time-consuming, require specialized taxonomic expertise, and often fail to identify cryptic species or juvenile life stages. DNA metabarcoding has revolutionized biodiversity assessments by enabling simultaneous identification of multiple species from mixed environmental samples using standardized genetic markers. While second-generation sequencing platforms can generate millions of DNA fragments with high accuracy, they are constrained by short read lengths that limit species-level resolution. Third-generation sequencing technologies can produce reads long enough to span complete gene operons, including the full ribosomal RNA complex, thereby enhancing taxonomic resolution and providing more information for accurate species identification. However, this shift toward long-read sequencing introduces analytical challenges: many bioinformatics tools were developed for short-read data, comprehensive reference databases for full-length sequences are lacking, and long amplicons are more susceptible to chimeric artifacts, which are artificial sequences formed during PCR amplification. This doctoral thesis addresses these challenges through three interconnected studies. First, a comprehensive review of available bioinformatics pipelines provides practical guidance for selecting appropriate tools based on sequencing platforms, data structure, and computational expertise. Second, the EUKARYOME database was developed as the first curated reference containing full-length ribosomal RNA sequences for all eukaryotes, covering over 172,000 species and enabling accurate taxonomic identification and chimera validation for long-read data. Third, evaluation of common chimera detection algorithms revealed high false-positive rates, where genuine biological sequences are incorrectly classified as artifacts. However, parameter tuning and secondary validation strategies can effectively reduce errors while conserving genuine sequences. Importantly, while method-specific biases affect taxonomic composition, their impact on community-level patterns remains limited. These findings collectively advance long-read metabarcoding as a robust tool for biodiversity assessment.
Kirjeldus
Doktoritöö elektrooniline versioon ei sisalda publikatsioone
Märksõnad
doktoritööd