RESUMO
INTRODUÇÃO: Neurocysticercosis is the most frequent parasitic disease in the human CNS. It is most prevalent in low and middle-income countries, where poor sanitation and free-roaming pigs are common. The analysis of how the transcriptome of the hosts brain adjacent to the cyst and how the cyst changes throughout infection would help unravel the parasite-host immune systems interactions and facilitate understanding of the resulting disease. However, despite the Peruvian and Mexican initiatives to sequence the T. solium genome [1], it is still not entirely resolved.
OBJETIVOS: This way, we saw the need to improve the annotation of the T. solium genome using public transcriptome data
MÉTODOS: To do so, we used publicly available T. solium RNA-Seq data [2] deposited at the NCBI’s Sequence Read Archive Database, following Ji et al., 2020 pipeline [3] for new genetic elements discovery.
RESULTADOS: HISAT2[4] aligned the transcriptome data to the T. solium reference genome with an 89.26% alignment rate. Stringtie[5] and QUAPRA[6] assembled the aligned reads, creating new gtf files, followed by Cuffcompare, to compare these new gtf files to the one from the reference genome. 23,252 new mRNAs were found for Stringtie and 20,743 for QUAPRA, and, of these, Cuffcompare[7] classified 3,216 (Stringtie) and 3,334 (QUAPRA) as potentially new transcripts. For coding-ability prediction, the new transcripts with FPKM > 1 were then submitted to CPAT[8]. CPAT generates a coding score cutoff after training with the target organism coding and non-coding mRNA dataset. To overcome T. solium and other cestodes lack of non-coding genes annotation, we created a C. elegans training dataset. From 2134 transcripts analyzed by CPAT, 121 (Stringtie) and 912 (QUAPRA) were above the coding-score cutoff. Of those, 94 (Stringtie) and 616 showed high similarity to close cestoda species or C. elegans compared with the UniProtKb/SwissProt curated cestoda proteome database. Transcripts below CPAT cutoff and that also do not present any protein family domain, and low similarity with any known protein will be considered as potential non-coding genes. These will also be submitted to miRTools 2.0 to predict and characterize non-coding genes.
CONCLUSÃO: Our adapted pipeline for discovering new genetic elements based on transcriptomic data demonstrated to have an excellent potential for improving the current T. solium reference genome annotation, with the possibility of including at least 600 new protein-coding genes. The next steps are to compare the results obtained from Stringtie with those from QUAPRA, quantify the potential non-coding transcripts, annotate the new findings in a new gtf file and submit it to the Wormbase database for public use.
BIBLIOGRAFIA: 1- T. solium genome - https://parasite.wormbase.org/Taenia_solium_prjna170813/Info/Index
2-T.solium public RNA-seq (GSM2227058,SRX1899230) - https://www.ncbi.nlm.nih.gov/sra?term=SRX1899230
3- Ji X. et al., doi: 10.1093/nar/gkaa638.
4- Kim D. et al., doi: 10.1038/s41587-019-0201-4.
5- Pertea M. et al., doi: 10.1038/nbt.3122.
6- Ji X. et al. doi: 10.1007/s11427-018-9433-3.
7-Trapnell C. et al., doi:10.1038/nbt.1621.
8-Wang L. et al., doi: 10.1093/nar/gkt006.
9- UniProt website - https://www.uniprot.org/ (accessed and data downloaded in Sep. 2021)
PALAVRA-CHAVE: Genome Improvement, Taenia solium, Neurocysticercosis, RNA-Seq