WHOLE GENOME RESEQUENCING
Structure variants in Chinese population and their impact on phenotypes, diseases and population adaptation
Nanopore | PacBio | Whole genome re-sequencing | Structural variation callling
In this study, Nanopore PromethION sequencing was provided by Biomarker Technologies.
Highlights
In this study, an overall landscape of structural variations(SVs) in human genome was revealed with help of long-read sequencing on Nanopore PromethION platfrom, which deepens the understanding of SVs in phenotypes, diseases and evolution.
Experimental Design
Samples: Peripheral blood leukocytes of 405 unrelated Chinese individuals(206 males and 199 females) with 68 phenotypic and clinical measurements. Among all individuals, ancestral regions of 124 individuals were provinces in North, those of 198 individuals were South, 53 were SouthWest and 30 were not known.
Sequencing strategy: Whole genome long-read sequencing(LRS) with Nanopore 1D reads and PacBio HiFi reads.
Sequencing platform: Nanopore PromethION; PacBio Sequel II
Structure Variation Calling
Figure 1. Workflow of SV calling and filtering
Main Achievements
Structure variation discovery and validation
Nanopore dateset: In total of 20.7 Tb clean reads generated on PromethION sequencing platform, achieving an average of 51 Gb data per sample, approx. 17-fold in depth.
Reference genome alignment(GRCh38): Average mapping rate of 94.1% was achieved. The mean error rate(12.6%) was similar to a prior benchmarking study(12.6%) (Figure 2b and 2c)
Structure variation(SV) calling: SV callers applied in this study included Sniffles, NanoVar and NanoSV. High-confidence SVs were defined as SVs identified by at least two callers and passed filtrations on depth, length and region.
An average of 18,489(ranging from 15,439 to 22,505) high-confidence SVs were identified in each sample. (Figure 2d, 2e and 2f)
Figure 2. Overall landscape of SVs identified by Nanopore dataset
Validation by PacBio: SVs identified in one sample(HG002, child) were validated by a PacBio HiFi dataset. The overall false discovery rate(FDR) was 3.2%, illustrating a relatively reliable SV identification by Nanopore reads.
Non-redundant SVs and genomic features
Non-redundant SVs: A set of 132,312 non-redundant SVs were obtained by merging SVs in all samples, which includes 67,405 DELs, 60,182 INSs, 3,956 DUPs and 769 INVs. (Figure 3a)
Comparison with existed SV datasets: This dataset was compared to published TGS or NGS dataset. Within the four datasets compared, LRS15, which is also the only dataset from long-read sequencing platform(PacBio) shared the largest overlaps with this dataset. Moreover, 53.3%(70,471) of SVs in this dataset were reported for the first time. By looking into each SV type, the number of recovered INSs with long-read sequencing dataset was much larger than the rest short-read ones, indicating that long-read sequencing is particularly efficient in INSs detection. (Figure 3b and 3c)
Figure 3. Properties of non-redundant SVs for each SV type
Genomic features: Number of SVs was found significantly correlated with chromosome length. Distribution of genes, repeats, DELs(green), INS(blue), DUP(yellow) and INV(orange) were displayed on a Circos diagram, where a general increase in SV were observed at the end of chromosome arms. (Figure 3d and 3e)
Length of SVs: Lengths of INSs and DELs were found to be significantly shorter than those of DUPs and INVs, which agreed with those identified by PacBio HiFi dataset. Length of all identified SVs added upto 395.6 Mb, which occupied 13.2% of entire human genome. SVs affected 23.0 Mb(approx. 0.8%) of genome per individual in average. (Figure 3f and 3g)
Functional, phenotypical and clinical impacts of SVs
Predicted loss of function(pLoF) SVs: pLoF SVs were defined as SVs interacted with CDS, where coding nucleotides were deleted or ORFs were altered. In total of 1,929 pLoF SVs affecting CDS of 1,681 genes were annotated. Within those, 38 genes highlighted “immunoglobulin receptor binding” in GO enrichment analysis. These pLoF SVs were further annotated by GWAS, OMIM and COSMIC, respectively. (Figure 4a and 4b)
Phenotypically and clinically relevant SVs: A number of SV in nanopore dataset were shown to be phenotypically and clinically relevant. A rare heterozygous DEL of 19.3 kb, known to cause alpha-thalassemia, were identified in three individuals, which dysfunctioned genes of Hemoglobin Subunit Alpha 1 and 2( HBA1 and HBA2). Another DEL of 27.4 kb on gene coding Hemoglobin Subunit Beta(HBB) was identified in another individual. This SV was known to cause serious hemoglobinopathies. (Figure 4c)
Figure 4. pLoF SVs associated with phenotypes and diseases
A common DEL of 2.4 kb was observed in 35 homozygous and 67 heterozygous carriers, which covers the complete region of the 3rd exon of Growth Homone Receptor(GHR). The homozygous carriers were found significantly shorter than heterzygous ones(p=0.033). (Figure 4d)
Furthermore, these SVs were processed for population evolutionary studies between two regional groups: North and South China. Significantly differential SVs were found distributed on Chr 1, 2, 3, 6,10,12,14 and 19, within which, top ones were associated with immunity regions, such as IGH, MHC, etc. It is reasonable to spectulate that the differentiation in these SVs may due to genetic drift and long-term expose to diverse envronments for sub-populations in China.
Reference
Wu, Zhikun, et al. “Structural variants in Chinese population and their impact on phenotypes, diseases and population adaptation.” bioRxiv (2021).
News and Highlights aims at sharing the latest successful cases with Biomarker Technologies, capturing novel scientific achievements as well as prominent techniques applied during the study.
Post time: Jan-06-2022