--------------------------------- Introduction to pedigree problems --------------------------------- The problem is to identify genotype incompatibilities in pedigree data. A pedigree data consists in a list of individuals with their parental relationship. For each individual, there are two alleles at a given locus of interest, one coming from the father, the other one from the mother. Zero, one, or both allele(s) are known depending on genotyping data. For each nuclear family (parents and their children), Mendelien rules define the possible combination of alleles between parents and children: for every child, its father allele must be identical to one allele of its father and its mother allele must be identical to one allele of its mother. The first question is to determine if Mendelien rules are verified for all the families wrt the available genotyping data. If this is the case, then the goal is to detect for each individual all the possible alleles that are part of a solution. If the problem is inconsistent, then the goal is to find the minimum number of genotyping data to be removed so that the problem becomes satisfiable. More formally, the pedigree problem is defined by a sextuple (n,X,D,Cu,Cb,Ct). n is the number of possible different alleles in the pedigree (alleles take values in [1,n]). X is a set a decision variables. x_i in X encodes both alleles (father and mother alleles) associated to individual i. D is the initial domain, the same for all variables. The phase (from which parent comes each allele) is not taken into account in this model and D contains n*(n+1)/2 values "ab" generated by varying first allele a from 1 to n and second allele b from a to n. Cu is a set of |X| soft unary constraints. For each individual i, C_i in Cu associates a unit cost for values in x_i not compatible with the available genotyping data. If there is no genotyping data, then C_i is always satisfied. Cb is a set of hard binary constraints (|Cb| <= |X|). For each individual i with only one parent j which is present in the pedigree, then C_{ij} in Cb encodes Mendelien rules between i and j: first or second allele of x_i must be identical to first or second allele of x_j. Ct is a set of hard ternary constraints (|Ct| <= |X|). For each individual i with two parents j and k which are present in the pedigree, then C_{ijk} in Ct encodes Mendelien rules between i, j, and k: first allele of x_i must be identical to first or second allele of x_j or x_k, second allele of x_i must be identical to first or second allele of x_k or x_j, such that if first allele of x_i has selected x_j (resp. x_k) then second allele of x_i must select x_k (resp. x_j). Other models, e.g. including the phase or dedicated to typing error correction (using Most Probable Explanation), are described in: Mendelian error detection in complex pedigree using weighted constraint satisfaction techniques S. de Givry, Z. Vitezica and T. Schiex ICLP-05 workshop on Constraint Based Methods for Bioinformatics October 5, 2005, Sitges (Spain) http://www.inra.fr/mia/T/degivry/Schiex05a.pdf and Mendelian error detection in complex pedigree using weighted constraint satisfaction techniques M. Sanchez, S. de Givry, and T. Schiex Constraints, 2007, To appear. ------------------ Pedigree instances ------------------ Three instances were described in: An Optimal Algorithm for Automatic Genotype Elimination O'Connell, Weeks J. Hum Genet. 65:1733-1740, 1999 * Sobel (7 individuals, 1 loop, 3 alleles) * Saudi Arabia (parkinson filename, 37 individuals, 6 loops, 5 alleles) * Connell (12 individuals, 1 loop, 3 alleles, 1 critical genotype) Two others were described in: PedCheck: A Program for Identification of Genotype Incompatibilities in Linkage Analysis O'Connell, Weeks Am. J. Hum Genet. 63:259-266, 1998 * Pedigree from Wijsman and Guo [Ott,1993] (cancer filename, 49 individuals, 8 alleles, 1 critical genotype) * Eye-disease pedigree [Stringham&Boehnke,1996] (eye filename, 36 individuals, 6 alleles, 1 critical genotype) pedckX and pedclassA/B/C/D instances have been automatically generated by Zulma Vitezica (INRA, Toulouse, France) simulator called pedigreeSim. Usage: pedigreeSim number_of_generations number_of_founders number_of_males_in_founders Output: random pedigree in file ped.pre and list of introduced errors in file ped.errors. (see also generate.sh/generateind.sh/generatemale.sh scripts for generating several instances automatically) The berrichon1/(previously named sheep4n) instance (129516 individuals, 4 alleles) is a large pedigree of sheep provided by Zulma Vitezica and Isabelle Palhiere, INRA, Toulouse, France. It has been reduced in a smaller sub-instance by removing uninformative individuals (sheep4nr, 8990 individuals, 4 alleles). berrichon1nc/sheep4 (resp. sheep4r) is a sub-instance of berrichon1 (resp. sheep4nr) such that some simple nuclear-family typing errors have been removed by removing parental and typing data of corresponding erroneous children. The instances berrichon2nc/bcf2 and berrichon2/bcf3 are derived from berrichon1 and contain more genotypings. The instance charolais.pre, also provided by INRA and CTIG, has not been solved yet. Instances with suffix _r.pre have been reduced by removing non-informative individuals in .pre pedigrees (i.e. individuals with no genotyping data nor any of their descendants). All instances in .lp, .wcnf, .wcsp, .uai, .dzn format are based on reduced pedigrees. ------------------- Format descriptions ------------------- Simplified .pre format: ---------------------- column 1 pedigree number column 2 individual number column 3 father number column 4 mother number column 5 individual sex (1:male, 2:female) [extra column in normal .pre format: disease (1:no, 2:yes)] column 6 first allele at locus 1 column 7 second allele at locus 1 [extra column in normal .pre format: first allele at locus 2, and so on.] .cp format (only valid for atmost 9 different alleles): ------------------------------------------------------ Pedigree name (number of possible alleles, phase, problem upper-bound) ; (following description valid if no phase) For each individual, a decision variable (named "x" + individual number) and its initial allele domain ; Initial allele domains are all the same: each value is the concatenation of first and second allele values (e.g. 13 means first allele is 1 and second allele is 3) ; Symmetries are broken by the following convention: first(allele) <= second(allele) ; For each individual, a soft unary constraint associates a zero cost to allele combinations given by genotyping data and a unit cost for other combinations ; If genotyping data is unknown (special value: zero) then the soft unary constraint is always satisfied ; For each individual with known father and mother, there is a hard ternary constraint for imposing Mendelien rules on alleles ; low(individual,number of possible alleles) returns the first allele associated to individual decision variable ; high(individual,number of possible alleles) returns the second allele associated to individual decision variable. .wcsp format (readable by toolbar, automatically produced by pedigree.sh script): -------------------------------------------------------------------------------- See format section in http://carlit.toulouse.inra.fr/cgi-bin/awki.cgi/SoftCSP .ergo format (readable by toolbar and Irina Rish's software): ------------------------------------------------------------ (orgininaly defined by Noetic Systems http://www.noeticsystems.com/) See benchmark section in http://carlit.toulouse.inra.fr/cgi-bin/awki.cgi/SoftCSP ---------------------------- Software brief documentation ---------------------------- MendelSoft accepts data in pre format directly and is available at http://www.inra.fr/mia/T/MendelSoft/ Follow this link for the documentation. ---------------------------- OLD STUFF --------------------------------- A preliminary software was composed of a set of shell scripts and toolbar, our first WCSP solver. * pedigree.sh convert pedigree data in pre format into wcsp and ergo formats readable by toolbar (warning! you need gawk version 3.1 or later to be installed, see http://carlit.toulouse.inra.fr/cgi-bin/awki.cgi/CpWcspFormats) Usage: pedigree.sh data.pre Results: data.cp (intermediate format), data.wcsp and data.ergo * getmonocritics.sh find the list of mono-critic typings in an inconsistent pedigree data (such that removing any of these typings restore mendelian consistency) (prerequisite: run pedigree.sh before) Usage: getmonocritics.sh data.pre Results: list of erroneous typings * mpe.sh find the optimal mono-critic typing correction based on maximum likelihood computation (prerequisite: run pedigree.sh before) Usage: mpe.sh data.pre Results: list of possible typing corrections with log10likelihood ratios * toolbar generic weighted constraint satisfaction solver version 2.1a see also http://carlit.toulouse.inra.fr/cgi-bin/awki.cgi/ToolBarIntro (prerequisite: run pedigree.sh before) - prove inconsistency Usage: toolbar -u1 -p3 data.wcsp Results: "Lower bound: 1" if inconsistent, "Optimum: 0" if globally consistent - perform genotype elimination for a consistent pedigree Usage: toolbar -u1 -p3 -o6 -v data.wcsp Results: for each individual and each typing, returns "0" if consistent and "*" if not - find the minimum number of errors Usage: toolbar -p3 -v data.wcsp | awk -f solution2cp.awk data.cp - Results: optimum value (possibility to speed up the search by giving an initial upperbound, see -h and -u options) - propose an optimal correction based on a probabilistic model Usage: toolbar -p3 -v -f3 data.ergo | awk -f solution2cp.awk data.cp - ------------------------------------- Simon de Givry, INRA Toulouse, France