We obtained 381 whole-mt DNA sequences from the 1KGP [].

Many recent genetic studies explored different layers of South Asian genetic diversity and population structure [], but they have tended to focus on one or other marker system and, as a result, decisive results on the details of the settlement process are still lacking.

In the last few years, genome-wide (GW) studies have been employed [].

In order to discern migrations into the Subcontinent at different time periods, we also performed a complementary analysis of several “non-autochthonous” N lineages present in South Asia (H2b, H7b, H13, H15a, H29, HV, I1, J1b, J1d, K1a, K2a, N1a, R0a, R1a, R2, T1a, T2, U1, U7, V2a, W and ].

We assessed ML estimations using PAML 4 and the same mitogenome clock assuming the REV mutation model with gamma-distributed rates (discrete distribution of 32 categories) and two partitions, in order to distinguish hypervariable segments I and II (HVS–I and HVS–II) from the rest of the molecule.

India is a patchwork of tribal and non-tribal populations that speak many different languages from various language families.

Indo-European, spoken across northern and central India, and also in Pakistan and Bangladesh, has been frequently connected to the so-called “Indo-Aryan invasions” from Central Asia ~3.5 ka and the establishment of the caste system, but the extent of immigration at this time remains extremely controversial.

We performed runs both assuming and not assuming a molecular clock, in order to perform likelihood ratio tests (LRT) [], we additionally estimated node ages in different sub-regions of the Subcontinent (west, south, central and east) with two different approaches: (1) considering all samples from a given region, regardless of the putative geographical origin of the clade and (2) considering the most probable origin of each major haplogroup (by considering branching structure, number of main branches, and centre of gravity) and including only basal lineages of each region [ 0.25, with a window size of 100 SNPs and step size of 1), yielding a subset containing 164,149 SNPs.

We subjected these to principal component analysis (PCA) using the standard PCA tool provided in EIGENSOFT v6.0.1 [], with which we calculated the first 10 principal components (PCs), from which we calculated the fraction of variance.

Whilst current genome-wide analyses conflate all dispersals from Southwest and Central Asia, we were able to tease out from the mitogenome data distinct dispersal episodes dating from between the Last Glacial Maximum to the Bronze Age.

Moreover, we found an extremely marked sex bias by comparing the different genetic systems.

The maternally inherited mitochondrial DNA (mt DNA) allows researchers to identify specific lineage clusters (clades or haplogroups) and to correlate them with geography.