Contact Author: Yuan Wei (yuan.wei@ucf.edu)
Recomb-Mix program uses C++ Boost and OpenMP Libraries (https://www.boost.org/, https://www.openmp.org/), and is compiled using GCC 9.1.0 with -Os optimization flag under a 64-bit Unix-based operating system:
g++ -std=c++17 -fopenmp RecombMix.cpp -l boost_iostreams -o RecombMix_v0.7 -Os
Recomb-Mix program has below parameters:
- p, or panel
<INPUT PANEL FILE>, where<INPUT PANEL FILE>is the input reference panel path and file name (required). - q, or query
<INPUT QUERY FILE>, where<INPUT QUERY FILE>is the input admixture panel path and file name (required). - g, or genetic
<INPUT GENETIC MAPPING FILE>, where<INPUT GENETIC MAPPING FILE>is the input genetic mapping path and file name (required). - a, or ancestry
<INPUT POPULATION ANCESTRY FILE>, where<INPUT POPULATION ANCESTRY FILE>is the input population labels of reference panel path and file name (required). - o, or output
<OUTPUT DIRECTORY PATH>, where<OUTPUT DIRECTORY PATH>is the output directory path for all files (optional; default is the current directory). - i, or inferred
<OUTPUT INFERRED FILE NAME>, where<OUTPUT INFERRED FILE NAME>is the output inferred local ancestry file name (optional; default is admix_inferred_ancestral_values_local.txt). - e, or weight
<WEIGHT>, where<WEIGHT>is the weight of cross population penalty in cost function (optional; default is 1.5). - f, or frequency
<ALLELE FREQUENCY>, where<ALLELE FREQUENCY>is the minor allele frequency threshold to exclude the allele values for the markers whose minor allele frequencies are below the threshold (optional; default is 0). By default, it is assumed that the reference panel contains the sequence data. If the reference panel contains SNP-array-like data, it is recommended to use this parameter to filter out minor alleles for the markers based on the given threshold. - u, or outputcompactpanel
<IDENTIFIER>, where<IDENTIFIER>(0 or 1) specifies whether the program outputs a compact reference panel (optional; default is 0: no output). - s, or estimate
<MAXIMUM GAP PHYSICAL DISTANCE>, where<MAXIMUM GAP PHYSICAL DISTANCE>is the maximum gap physical distance (number of markers) for local ancestry estimation (optional; default is 0: no estimation). A gap refers to a query region with at least one but no more than the maximum gap physical distance markers that are not present in the reference panel. An estimate is made only if the inferred ancestral labels of both the left and right adjacent regions are identical, and the shared ancestral label is used to smooth out the gap. - t, or threads
<NUMBER OF THREADS>, where<NUMBER OF THREADS>is the number of CPU cores to use (optional; default is the number of available CPU cores).
An example command of running the Recomb-Mix program:
./RecombMix_v0.7 -p ./test/reference_panel.vcf -q ./test/admixture_panel.vcf -a ./test/reference_panel_population_labels.txt -g ./maps/recombination_map_GRCh37_chr18.txt
The command to get the help of the program:
./RecombMix_v0.7 -h
Recomb-Mix program utilizes compact reference panels for local ancestry inference. A compact reference panel is space-efficient, as it includes only sample templates containing population-level information. The available compact reference panels (located in ./compact_panels/tgp_hgdp) were generated using the Recomb-Mix's output compact panel option. The original panels were comprised of the 1000 Genomes Project (TGP) and the Human Genome Diversity Project (HGDP) (https://www.internationalgenome.org/), and were phased and imputed using Beagle.
Recomb-Mix program can generate a compact panel from a given reference panel and population labels of the reference panel. Below is an example command:
./RecombMix_v0.7 -p ./test/reference_panel.vcf -a ./test/reference_panel_population_labels.txt -o ./result/ -u 1
The generated compact reference panel file and its population labels file are saved in the given output folder. They can be reused for future ancestry inference queries. Below is an example command:
./RecombMix_v0.7 -p ./result/compact_reference_panel.vcf -q ./test/admixture_panel.vcf -a ./result/compact_reference_panel_population_labels.txt -g ./maps/recombination_map.txt -o ./result/
One can generate a compact reference panel while making local ancestry inference calls on given queries against a given reference panel. The above commands are equivalent to the below one:
./RecombMix_v0.7 -p ./test/reference_panel.vcf -q ./test/admixture_panel.vcf -a ./test/reference_panel_population_labels.txt -g ./maps/recombination_map.txt -o ./result/ -u 1
Four input files are required to run Recomb-Mix program: the reference panel file in VCF or compressed VCF (*.vcf or *.vcf.gz) format, the admixture panel file in VCF or compressed VCF (*.vcf or *.vcf.gz) format, the genetic map file in HapMap text format, and the population labels of reference panel in text format. More than one individual can be included in the admixture panel. The genetic map file uses the HapMap format, whose description should be found in the first line of the file. The format of each line starting with the second line contains four tab-delimited fields: Chromosome, Position(bp), Rate(cM/Mb), and Map(cM). Note that the value of the Rate(cM/Mb) field is not used. Instead, it is calculated based on the Position(bp) and Map(cM) fields. If the genetic mapping of the physical position in the VCF file is not found, interpolation is used to estimate the genetic distance of such position. The population labels of the reference panel contain individual's population label per line in a tab-delimited fashion: Sample id, Population label.
The output file contains the inferred ancestry labels of each individual haplotype in the admixture panel in a tab-delimited text format. Each line represents the result of one individual haplotype, starting with Admixture individual haplotype id, followed by a list of inferred segments, having three fields: Physical start position, Physical end position, Inferred ancestry label id. The Inferred ancestry label id is a zero-based indexing of population labels in the order of appearance in input population labels of the reference panel file, which can be found in the first line of the file.
Below is a tutorial on how to run Recomb-Mix using the example provided in the test folder in this repository. This example infers local ancestry labels of each site for three admixed individuals (data is in ./test/admixture_panel.vcf file), using 30 reference individuals (data of 10 Africans, 10 Europeans, and 10 Asians is in ./test/reference_panel.vcf file, and their population labels data is in ./test/reference_panel_population_labels.txt file). The recombination rates used for the inference are loaded from a recombination map (data is in ./maps/recombination_map.txt file).
- Clone the repository in a local directory
git clone https://github.com/ucfcbb/Recomb-Mix.git
- Run the Recomb-Mix example in the local directory
./RecombMix_v0.7 -p ./test/reference_panel.vcf -q ./test/admixture_panel.vcf -a ./test/reference_panel_population_labels.txt -g ./maps/recombination_map.txt
The inferred ancestry labels of each admixed individual haplotype per site are output to a file in the current directory (result data is in ./admix_inferred_ancestral_values_local.txt). The output file format can be found in Input and Output Files section.