Introduction

Canavan disease (CD) is an autosomal recessive neurodegenerative disorder due to homozygous or compound heterozygous pathogenic variants in ASPA. Though it is pan ethnic, highest prevalence of cases is seen in Ashkenazi Jews. The overall prevalence of CD is 1 in 200,000 to 400,000 [1], whereas the prevalence in the Ashkenazi population is around 1 in 500 to 1 in 6700 [2, 3]. CD presents in early infancy by 3 to 6 months with evident macrocephaly and global developmental delay. As the disease advances, spasticity, seizures and optic atrophy sets in with early mortality. In contrast, the less common M/JCD presents with near normal period of infancy or only mild delay in motor milestones. Most children with M/JCD achieve walking, attend school and are normocephalic. M/JCD has a highly variable and protracted course with gradual development of neurological manifestations. It is established that, the severity of the presentation depends on the degree of residual activity of the aspartoacylase enzyme. Mildest cases with the highest residual activity, a clear genotype–phenotype correlation is currently possible [1].

We report here a comprehensive genetic investigation into the pathogenetics of two families (Fig. 1) hailing from a community of southern India with probably the highest clustering of M/JCD recorded historically, with a novel pathogenic variant. The probands belong to a community referred to as Telugu Devanga Chettiars. As per the historical records, this weaver community migrated during the expansion of Vijayanagara empire between 15th and 16th centuries from Andhra Pradesh and settled in Tiruchirappalli [4]. A second wave of migration began at about 150 years ago from Tiruchirappalli to a small geographical region near Tiruvannamalai in Tamil Nadu. Currently TDC has a population size of just over 3000, inhabiting three closely situated villages. The language spoken by TDC is Telugu, while the dominant language spoken in the district is Tamil. They are orthodox Hindus and customarily marry only within the community. Overall, their culture, linguistics, mating pattern and the geographical isolation has resemblance to an island population. Historically at least 24 such cases of M/JCD have been observed by the community within the last 3–4 generations with an average life span of 20–25 years. Currently around 10 such cases are alive, 3 of whom have been evaluated by us. The mating pattern is less than sufficient to explain the high rates of an extremely rare form of neurodegenerative disorder with reduced fitness. Hence, our objectives were; (1) to explore the mechanism underlying disease causation by the novel pathogenic variant and (2) explore the possibility of a founder event and subsequent endogamy leading to this high prevalence.

Fig. 1: Pedigree of the family representing the generations, MRCA, and segregation of ASPA pathogenic variant.
figure 1

Individual 2 and 6 from generation IV are the common link between the two families with the MRCA in generation I. Note that individual 6 from generation IV is not a carrier. Individual 2 from generation IV was not sequenced. Affected individuals with M/JCD from two families are represented with arrows.

Materials and methods

Clinical presentation

Proband 1, currently 8-year-old male was referred to us at the age of 3 years for the evaluation of developmental delay. His best achieved developmental milestones at presentation were walking with support, immature pincer grasp and speaking bi-syllables. Antenatal and perinatal history was uneventful with a normal birth weight of 3.24 kg and head circumference of 36 cm. Neurological examination revealed normocephaly, horizontal nystagmus, hypotonia with diminished deep tendon reflexes and gait ataxia. Fundus examination showed no evidence of optic atrophy. Slow gain in development was observed during follow-up.

MR Brain imaging done at 18 and 27 months showed characteristic features of M/JCD; confirmed by an elevated N-acetyl aspartate peak in MRS (Supplementary Fig. 1). There was no interval progression between the scans. Urine gas chromatography/mass spectrometry analysis revealed a 15-fold elevation of N-acetyl aspartic acid levels.

Proband 2 is a 7-year-old male who was referred at the age of 1 year for evaluation of developmental delay. Antenatal and perinatal history was uneventful with a normal birth weight, and head circumference of 3 kg and 34.5 cm respectively. During his first visit, mild lag in all developmental domains were present (developmental age ~10–11 months). On examination, growth parameters were within normal range including head circumference. Nervous system examination showed no deficits. During follow-up at 2 years of age, ataxia of gait and stance was observed. Deficits in hand coordination was observed at 3.5 years of age. MRI Brain showed similar findings as proband 1.

Mutation analysis

Sanger sequencing of the coding exons of the ASPA was performed in the probands and family members using BigDye Terminator v3.1 kit and ABI 3500 Genetic Analyzer following standard manufacturer's protocol (Supplementary Fig 2). Various in silico analyses tools were used and the variant was categorized according to the American college of medical genetics (ACMG) guidelines [5]. Whole exome sequencing (WES) was performed for both the probands using Twist core exome kit (Twist Biosciences, USA) and sequenced on Illumina NextSeq 550 sequencer. The data was analyzed using Sentieon and VarSeq (Golden Helix, Inc, USA).

cDNA studies

Total RNA was isolated from the patient-derived fibroblast cell line [6]. cDNA was synthesized using RevertAid first strand cDNA synthesis kit (Thermo Fisher Scientific, Inc., USA). Sanger sequencing was performed using primers for exons 2–5 (Supplementary Table S6).

Genome-wide genotyping array

A total of 14 individuals from 2 families of TDC (Fig. 1) and 20 healthy control individuals were genotyped. Controls included were (i) TML, Tamil speaking population sharing the same geographical locality as TDC (ii) north Indians (NOI) natively speaking Indo-European languages, residing in the northern part of India and (iii) local population with second or third-degree consanguineous parentage (CSG). Genotyping was performed using Infinium HumanCytoSNP-12 v2.1 BeadChip as per the manufacturer’s protocol (Illumina, Inc., USA). Genotype data were extracted and were exported as PLINK.ped/.map files [7, 8]. Following QC steps, a total of 299,140 markers were considered. Of which, only autosomal markers were considered for further downstream analysis. After removing markers from sex chromosomes, a total of 275,544 remained. Furthermore, we extracted data from the 1000 Genomes Project [9] for Indian Telugu in the UK (ITU, n = 20), Gujarati Indians in Houston (GIH, n = 20), and Yoruba in Ibadan (YRI, n = 20) populations for the corresponding markers which were genotyped in-house. SNPs with missing genotypes in more than 10% individuals were excluded. This resulted in final set of 261,599 markers.

Estimation of fixation index and runs of homozygosity

Fixation index (F) with LD pruned (with r2 threshold of 0.3) data and runs of homozygosity (ROH) were calculated by PLINK1.9 [8]. LD pruning was not performed for ROH estimation, as our aim was to investigate not only parental relatedness but also to determine any occurrence of population bottleneck and isolation [10]. The threshold size of more than 500 kb was considered, as this largely eliminates the effect of linkage disequilibrium (LD) and therefore the frequent short ROH due to haplotype structure [10]. ROH so estimated were further subdivided into 3 length categories i.e., 0.5–1.99 Mb (class I ROH), 2–4.99 Mb (class II ROH) and >5 Mb (class III ROH). This subdivision is based on the knowledge of the role of diverse population genetic processes determining the size and frequency of ROH [11, 12]. The individual autozygosity measure based on ROH (FROH) was computed for three different ROH length thresholds as described previously [10]. Sum total length of ROH (SROH) against the total number of ROH (NROH) on x and y-axes respectively were plotted. Median values for all population groups were computed as described previously [11]. ROH spanning the entire ASPA was also explored among the population groups. ROH calculated from genotyping array was also compared for accuracy in size and marker details from the WES data.

Estimation of identity-by-descent (IBD) and kinship coefficients

Kinship coefficients and proportion of the genome that is IBD (propIBD) were computed using KING v2.2.7 [13]. We then computed pair-wise kinship coefficients among the members within families and between families. To robustly detect IBD segments, Refined IBD (v17Jan20.102) was used after phasing genotypes by Beagle 5.4 for seven unrelated individuals in the pedigree [14, 15].

Phasing and haplotype reconstruction in pedigrees

Haplotype reconstruction of chromosome 17p region (2980 markers) was performed using MERLIN v1.12 [16]. The haplotypes including recombination events were visualized using HaploPainter v1.043 [17]. The ROH regions around the ASPA pathogenic variant from both the probands were taken as starting point for haplotype analysis. The chromosome carrying the pathogenic variant from each of the carrier family members were aligned together and examined for similarity in markers among all the carriers and the affected. The haplotype regions so inferred were further re-examined for accuracy using PHASE v2.1.1 [18]. The LD (r2 and Dʹ) estimates for the markers in the so delineated haplotype region among the population groups were compared.

Population structure and admixture analysis

To investigate the current population structure and any admixture between local populations, we performed three separate statistical paradigms using data from 88 unrelated individuals with kinship coefficients < 0.044 as per KING. Principal component analysis (PCA) using SmartPCA from EIGENSOFT v7.2.1 [19, 20] with the results visualized in R program. To eliminate the distorting effect of LD on eigenvector structure and eigenvalues, ‘ldregress’ was applied. Further, in view of significant endogamy in Indian population leading to inflated FST estimates, we chose ‘inbreed: YES’ parameter [21]. The FST estimates, Tracy-Widom statistics and ANOVA p values were tabulated for the inferred populations. A Bayesian model-based admixture mapping was performed using STRUCTURE v2.3.4 [22, 23] on the LD pruned data with correlated allele frequencies and admixture model. The output was visualized using CLUMPAK [24]. Maximum-likelihood-based inference of population tree using TreeMix v1.13 [25] on the LD pruned data with YRI as the root population was performed. The output was visualized using R ‘plotting_funcs.R’ script.

Mutation age estimation

In view of the small sample size and the variant being novel, the Gamma method was used as described previously [26].

Results

Both the probands harbored a novel homozygous variant c.526G>A (NC_000017.10:g.3386886G>A; NM_000049.4(ASPA):c.526G>A) in exon 3. The missense variant p.Gly176Ser has not been reported previously in ClinVar/HGMD or in the population databases such as gnomAD, 1000 Genomes, GenomeAsia100K, and our in-house database. The variant is predicted to be damaging by both SIFT and PolyPhen2. The nucleotide c.526 is predicted to be highly conserved by GERP++ and PhyloP across 100 vertebrates. Analysis using MaxEntScan and NNSplice revealed a splicing defect. varSEAK splice tool predicted the replacement of the consensus splice donor site by the activation of a cryptic splice donor site 4 bp upstream (new position; c.523G). The variant segregated in a Mendelian recessive fashion among the family members. Hence, this variant was classified as pathogenic with ClinVar accession number VCV001700565.1.

The cDNA sequencing revealed a frameshift pattern starting 1 bp upstream of the novel variant (Supplementary Fig. 2). Deconvoluting the cDNA sequence revealed, presence of 2 possible transcripts. One transcript with the activation of an upstream cryptic splice site GT (at position c.523_524) along with splicing of the 4th exon as predicted. This would result in a premature termination at codon 176 (p.Gly176Ter) and theoretically many more premature termination codons further downstream. Such an mRNA is expected to undergo nonsense-mediated decay (NMD), as the variant is 223 bp upstream of the last exon-junction complex (EJC). Second transcript with the retainment of canonical splice site (escaping the splicing aberration) resulting in a complete transcript. Probably this results in a normal length protein harboring the missense change p.Gly176Ser. This non-classical splicing defect due to an exonic mutation has been classified as type IV splicing defect [27]. Such examples have been reported previously [28,29,30]. This is in line with the reported literature that, a milder missense variant and another pathogenic LOF variant often leads to M/JCD [1]. Further, studies in the form of NMD analysis/enzyme activity levels are required to confirm the above possibilities.

Fixation index

The history of migration of the TDC from their historical origin centuries ago probably with a bottleneck and a history of endogamy leads one to speculate the possibility of low genetic diversity. Fixation index (F) is the most basic measure of such a variation and the results were evidently concrete. On an average, there was 0.5–0.7% reduction in heterozygosity in the TDC as compared to the expected estimates under random mating. We also noticed a higher median F value compared to individuals (with no history of consanguinity) inhabiting the same geographical location (Table 1). The deviation, however, is an order of magnitude less compared to the individuals born of consanguineous parentage. This was an early indication to the possible role of endogamy but with a lesser degree than consanguinity in the TDC. Reduction in heterozygosity also implies an increased proportion of loci that are homozygous: the Wahlund effect [31].

Table 1 Table depicts various subpopulations studied with n samples.

ROH frequency and size among subpopulations

In Fig. 2A, the class I ROH represents similarity of the population living in the same geographical location as opposed to a population from a distant continent. For the class I ROH, no single subpopulation of Indian origin could be statistically differentiated based on SROH and NROH. However, each of the subpopulation showed higher median SROH and NROH compared to YRI which were statistically significant (Supplementary Table S4). This is comparable to previously reported positive correlation of mean SROH with the increasing geographical distance of the study population from East Africa [12].

Fig. 2: Total number of ROH (NROH) and sum total length of ROH (SROH) distribution for different length thresholds, represented as box and whisker plots.
figure 2

A, B and C represents ROH sizes 0.5–1.99 Mb, 2–4.99 Mb and >5 Mb respectively. D Represents plot of median NROH versus median SROH for distinct populations (ROH size threshold of >2 Mb was considered for calculation).

The size criterion for class II ROH was based on well-documented literature evidence, where subpopulations begin to show differences in both mean SROH and NROH [10, 12]. It is clear from the data (Fig. 2B) that TDC had significantly higher SROH and NROH compared to other subpopulations with no history of parental relatedness (Supplementary Table S4). Even though a small difference in the median values was found between TDC and individuals with parental relatedness (consanguinity), the difference was not statistically significant (p = 0.117). These findings are concordant with the literature evidence of distribution of intermediate-sized ROH [10, 11]. The intermediate-sized ROH are often due to the effect of increased background relatedness determined by the effective population size [10, 11]. Historical evidence of a probable population bottleneck in TDC and cultural isolation with consequent endogamy as well as the current population size, effectively explains increased background relatedness. Finally, for the class III ROH (Fig. 2C), TDC showed higher SROH and NROH (p < 0.005) compared to subpopulations with no history of parental relatedness. However, CSG differed significantly from TDC (p < 0.0016), which is in line with the increased occurrence of multiple long ROH in individuals with recent parental relatedness. It is apparent from the data that, despite the absence of consanguinity, there is substantially higher than expected levels of intermediate and long-sized ROH in the TDC. As can be noticed in Fig. 2D the TDC is on the left of the trendline suggesting that NROH is proportionately more than SROH. This is due to the fact that the class III ROH are few in TDC but the population is enriched with class II ROH. In contrast, the CSG group had the right shift because of numerous long class III ROH without a proportional increase in NROH. The pattern observed in TDC is analogous to worldwide patterns of numerous native populations with history of bottleneck, genetic isolation with a small effective population size, and consequent endogamy [11].

FROH is a better measure of autozygosity even in the absence of accurate and deep pedigrees, as FROH extracts true IBD as opposed F values from pedigrees (FPED) where the founders are often poorly defined. FROH also reliably captures autozygosity originating from ancient as well as recent parental relatedness [11]. TDC had higher FROH values compared to all other subpopulations (p < 0.05) except CSG (Table 1). The overall results suggest an increased amount of autozygosity in the TDC, in spite of the absence of recent parental relatedness.

IBD and kinship coefficients

Results of the kinship coefficient φ as per KING between pairs of individuals showed accurate relatedness among individuals with self-reported pedigree relationship. The results also unexpectedly unearthed cryptic relatedness with higher φ than expected among TDC individuals who were unrelated as per the pedigree (Supplementary Table S1). The propIBD-based relationship inference provides more robust estimation of relatedness and the results of which were concordant with the φ results. Refined IBD, a robust IBD segment caller showed that the summated fraction of genome that is identical by descent (LOD > 3.0) in the TDC founders were several folds higher than the unrelated individual pairs corroborating with the findings of propIBD estimates. This can be explained by the equation \(f = \left( {\frac{1}{2}} \right)^i\left( {1 + Fa} \right)\) [32]; where Fa is the cryptic probability of a founder in a pedigree who himself/herself is a product of inbreeding. Fa values increase in a population with endogamy alone or bottleneck superimposed with endogamy. Given the history of TDC, an inflated kinship coefficient/IBD estimates is not startling.

Haplotype analysis

As evident in Fig. 3, the family 2 harbored an unbroken segment of size 7.96 Mb which was segregating in a Mendelian fashion. The size exactly matches with the ROH found in the proband. However, in family 1 the haplotype has undergone reduction in size probably because of at least two independent ancestral recombination events, one each on the paternal and maternal side. The sizes of the segregating haplotypes were 4.73 Mb and 1.02 Mb respectively. The core haplotype of 1.02 Mb consisting of 147 SNPs is common to all the carriers and the affected. The Dʹ and r2 values between SNPs in the region of the haplotype were substantially high compared to other subpopulations. The sequence of the haplotype was also confirmed using PHASE 2.2 program. Further, similar sized haplotype was not found among any other individuals from various populations analyzed.

Fig. 3: Phased haplotypes on chromosome 17p representing shared haplotypes spanning ASPA locus between individuals from the two families.
figure 3

Segregating haplotypes of three different sizes (HA, HB, HC) are represented in red blocks. The figure also represents meiotic recombination events observed.

Population structure

PCA clearly grouped individuals into distinct clusters which correlated well with their geographical location and native languages (Table 2). The data were in good agreement with eigenvalues, Tracy–Widom statistic, and ANOVA P values (Supplementary Table S5). Clear cut population differentiation between TDC and TML who are inhabiting the same geographical region was noted (Fig. 4A). Both the populations were far removed from the other subpopulations such as NOI, GIH, and ITU along the PC1 and PC2 axes. Further, all these populations aligned on the PC2 axis and were at a notable distance from YRI on the PC1 axis (Supplementary Fig. 4A) as documented in previous studies [21]. This was the first clear indication as to the differentiation of TDC from other populations and the presence of a population structure. FST output after correcting for inbreeding, as per the Balding-Nichols model (Table 3) showed moderately great differentiation of TDC from YRI and moderate from other subpopulations of the Indian ancestry without any overlap [33]. The only overlap was between ITU with GIH (FST = 0.003; threshold for overlap considered < 0.004) [34].

Table 2 Statistical significance of population differentiation in the PCA analysis.
Fig. 4: Population clustering and pattern of divergence among the populations studied.
figure 4

A Principal Component Analysis of SNP data from various subpopulations. Eigenvector 1 (PC1) and eigenvector 2 (PC2) along with the variances were plotted for each individuals in the subpopulation. B Maximum-likelihood population tree using TreeMix (YRI was taken as root population). The tree shows the split of TDC and TML from a common ancestral population which in itself is a split from the common ancestors of ITU. The length of the branch representing TDC clearly shows a significant divergence from other subpopulations.

Table 3 Wright’s F-statistics (FST) estimated using the Balding-Nichols model [20, 39] indicating the pair-wise differentiation among the subpopulations.

TreeMix provided the conclusive maximum-likelihood tree representing the topological relationship of subpopulations. The Newick format ML tree (Fig. 4B) shows, TDC and TML splitting from a common ancestral population which in itself a split from the common ancestors of ITU. The length of the branch representing TDC clearly shows a significant divergence (p = 7 × 10−5) and probably minimal or no migration between TDC and TML. STRUCTURE results concluded the same (Supplementary Fig. 4B).

There is enough evidence as per the results of FST, PCA, STRUCTURE and Treemix to support a process akin to ‘lineage sorting’ having taken place. This is probably due to the significant population contraction (leading to drift) and long genetic isolation. As per the Island model (expected \(F_{{{{{\rm{ST}}}}}} = 1 - e^{ - \frac{1}{{2Ne}}}\)), a significant population contraction resulting in drift will increase differentiation and there by FST [35]. One of the main assumptions of the Island model is that no migration has taken place between the subpopulations. Based on the self-report of individuals from TDC, knowledge of their strict cultural practices, it is evident along with the above-mentioned findings that the TDC has undergone genetic isolation after a founding event with hardly any migration thereafter.

Mutation dating

Assuming a correlated genealogy, the mutation arose 15.1 generations ago, with a confidence interval of 3.3–26.5 corresponding to around 375 years (assuming one generation = 25 years). This coincides well with the age of historical migration of the first founders of TDC to the current location of inhabitation [4]. The Gamma method as opposed to coalescent-based methods, in fact provides an estimation of the age of the most recent common ancestor (MRCA) rather than the exact age of the mutational process.

Discussion

M/JCD often results from retained residual enzymatic activity of aspartoacylase. Various studies in M/JCD have observed that, a milder missense pathogenic variant such as p.Tyr288cys or p.Pro257Arg in compound heterozygous trans state with another pathogenic variant usually a loss of function (LOF) variant or a missense variant in exons coding for the major functional domains  [1, 36, 37]. The novel pathogenic variant c.526G>A found in both the probands, has been demonstrated to cause both the splicing defect (LOF) and probably a full-length protein with a missense change. This escape of a portion of transcripts from probable NMD is in line with the previous findings of pathogenetics of M/JCD. These evidences explain the milder phenotype in the probands with significantly longer life span compared to infantile CD.

The pedigree suggests no consanguinity what so ever in both the families. The two probands are related through the MRCA with not less than 9th degree relatedness. The known relatedness is through the paternal side of proband 1 and maternal side of proband 2 (Fig. 1). Even though there is MRCA between the two families, the deleterious allele is not passed through the individual connecting MRCA with the probands (IV-6 is not a carrier). Further, there is evidence for a high prevalence of the disorder in the population and currently at least 10 alive cases. The approximate prevalence provides an indirect estimate of a very high carrier frequency of about 1 in 10, calculated as per HWE. The M/JCD is a severe disorder with zero fitness. Ideally such deleterious recessive alleles are purged away in an inbred population; as homozygosity for such alleles eventually will be selected away. Hence, the novel variant is unlikely to be a new mutation in the recent past. Only a random genetic drift event or selection can change allele frequencies and not inbreeding alone, which can only change genotype frequencies [32].

The robust signatures of the founder event in TDC are supported by the following observations. (a) The high prevalence of the deleterious allele with reduced fitness. (b) Significant population differentiation as evidenced by FST, PCA, TreeMix results, without the probable effects of selection and migration. (c) A moderately high fixation index value with correspondingly high autozygosity (FROH) in the absence of consanguinity in TDC and the robust population structure infers the Wahlund effect [31]. (d) The left shift pattern of NROH vs. SROH of TDC for the class II and III ROH is indicative of a population bottleneck followed by endogamy [10]. (e) Elevated kinship coefficients, propIBD estimates and proportion of autosomal genome under IBD in TDC in the absence of consanguinity suggests an increased background relatedness. Higher proportions of IBD with longer IBD segments are often caused by founder events [38]. (f) A large haplotype spanning across ASPA in the absence of consanguinity is another indicator. ASPA locus and the flanking region on chromosome 17p is not a known ROH hotspot/ cold spot [12]. Genomic region encompassing ASPA is usually kept out of ROH by recombination events acted upon by selection pressures to avoid occurrence of homozygous deleterious alleles. Hence, the observation of smaller sized haplotypes in 17p region among rest of the subpopulations. However, ROH and the large haplotype in this region appears to be specific to TDC. (g) Mutation dating suggesting that the MRCA, for the entire community (the founder) who can be traced back in age, coincides with the documented historical migration [4]. Here we have provided a consilience of multiple genetic evidences to support the founder event in TDC. Numerous founder events related to recessive disorders, similar to the one described here have been well studied in the Indian population [21, 34].

The study has not only thrown light on pathogenetics, natural history of M/JCD in TDC but also the need for a community carrier screening program to prevent further occurrences of this disabling disorder. The science unearthed in this effort can also be a harbinger for future therapeutic studies considering the protracted course of M/JCD allowing itself to be amenable for early interventions.