\usepackage{pdfpages}

Bases for Genomic Prediction

Andres Legarra

Daniela A.L. Lourenco

Zulma G. Vitezica

2024-02-21

1 Foreword by AL (it only engages him)

This is an incomplete attempt to write a comprehensive review of principles for genomic predictions. The framework is proudly parametric and tries to follow classical quantitative genetics and statistical theory as much as possible. It is incomplete: the wealth of papers being generated makes impossible to follow all the literature. I express my apologies for the resulting self-centered bias.

My own knowledge on the topic owes much to dozens of colleagues with whom I have much worked and discussed. I explicitly thank Ignacy Misztal, Ignacio Aguilar, and all my collaborators for so much joint work and discussion. Financing for these notes was possible by the INRA metaprogram SelGen. They were written in May 2014, during a visit to the University of Georgia (UGA), kindly hosted by Ignacy Misztal; during this visit we taught a course whose material (slides, exercises, and these notes) can be found at http://nce.ads.uga.edu/wiki . Updated versions of these notes can be found at http://genoweb.toulouse.inra.fr/~alegarra. I thank Guillermo Martinez-Boggio, Llibertat Tusell and Paul VanRaden for corrections and comments.

I deeply thank all those people that have produced and made available notes and courses, which have been so useful for me during the years.

Yo no te buscaba y te vi.

September 2014. A large number of mistakes and typos have been corrected.

February 2, 2015. More corrections and few suggestions by Llibertat Tusell and Paul VanRaden.

May 13 2016. Corrected error in the Bayesian example (thanks Jesús Piedrafita).

Oct 4 2016. Added posterior variance of marker effects from GBLUP.

August 2017. Added backsolving of GBLUP to SNPBLUP when there is tuning of G

November 2017. Slight correction on same topic

April 2018. Large additions for UGA course.

May 2022. Few typos, plus addition of Method LR, Reliabilities using SNP effects from (ss)GBLUP.

The cover is a drawing of BlackBelly Sheep in Barbados made by José Javier Legarra. Gracias Tebe !

2 Main notation

\(\mathbf{X}\mathbf{,}\mathbf{b}\) Incidence matrix of fixed effects and fixed effects

\(\mathbf{a}\) Marker effects

\(\mathbf{u}\) Polygenic or additive genetic effects

\(\sigma_{ai}^{2}\), Variance of the marker effect \(a_{i}\)

\(\sigma_{a0}^{2}\) Variance of marker effects if all had the same variance

\(\sigma_{u}^{2}\) Genetic variance

\(\sigma_{e}^{2}\) Residual variance$

\(\mathbf{G}\) Genomic relationship matrix

\(p_{i}\) Allele frequency at marker \(i\)

\(\mathbf{A}\) Pedigree-based relationship matrix

3 A little bit of history

Based on Lourenco et al., 2017, BIF Conference.

Long before genomics found its way into livestock breeding, most of the excitement pertaining to research into livestock improvement via selection involved developments in the BLUP mixed model equations, methods to construct the inverse of the pedigree relationship matrix recursively (C. R. Henderson 1976; Quaas 1976), parameter estimation and development of new, measurable traits of economic importance. In particular for several decades (1970’s through the early 2000’s), lots of resources were invested in finding the most useful evaluation model for various traits. Since the 1970’s, the use of pedigree and phenotypic information has been the major contributing factor to the large amount of genetic progress in the livestock industry.

During the late 1970’s and early 1980’s, geneticists developed techniques that allowed the investigation of DNA, and they discovered several polymorphic markers in the genome. Soller and Beckmann (1983) described the possible uses of new discovered polymorphisms, and surprisingly, their vision of using markers was not much different than how DNA is used today in the genetic improvement of livestock. They hypothesized that markers would be beneficial in constructing more precise genetic relationships, followed by parentage determination, and the identification of quantitative trait loci (QTL). The high cost of genotyping animals for such markers probably prevented the early widespread use of this technology. However, valuable information came along with the first draft of the Human genome project in 2001 (Group 2001) : the majority of the genome sequence variation can be attributed to single nucleotide polymorphisms (SNP).

After all, what are SNPs? The genome is composed of 4 different nucleotides (A, C, T, and G). If you compare the DNA sequence from 2 individuals, there may be some positions were the nucleotides differ. The reality is that SNPs have become the bread-and-butter of DNA sequence variation (Stoneking 2001) and they are now an important tool to determine the genetic potential of livestock. Even though several other types of DNA markers have been discovered (e.g., microsatellites, RFLP, AFLP) SNPs have become the main marker used to detect variation in the DNA. Why is this so? An important reason is that SNPs are abundant, as they are found throughout the entire genome (Schork, Fallin, and Lanchbury 2000). There are about 3 billion nucleotides in the bovine genome, and there are over 30 million SNPs or 1 every 100 nucleotides is a SNP. Another reason is the location in the DNA: they are found in introns, exons, promoters, enhancers, or intergenic regions. In addition, SNPs are now cheap and easy to genotype in an automated, high-throughput manner because they are binary.

One of the benefits of marker genotyping is the detection of genes that affect traits of importance. The main idea of using SNPs in this task is that a SNP found to be associated with a trait phenotype is a proxy for a nearby gene or causative variant (i.e., a SNP that directly affects the trait). As many SNPs are present in the genome, the likelihood of having at least 1 SNP linked to a causative variant greatly increases, augmenting the chance of finding genes that actually contribute to genetic variation for the trait. This fact contributed to much initial excitement as labs and companies sought to develop genetic tests or profiles of DNA that were associated with genetic differences between animals for important traits. Suddenly, marker assisted selection (MAS) became popular. The promise of MAS was that since the test or the profile appeared to contain genes that directly affect the trait, then potentially great genetic improvement could be realized with the selection of parents that had the desired marker profile. It is not hard to see this would work very well for traits affected by one or a couple of genes. In fact, several genes were identified in cattle, including the myostatin gene located on chromosome 2. When 2 copies of the loss-of-function mutation are present, the excessive muscle hypertrophy is observed in some breeds, including Belgian Blue, Charolais, and Piedmontese (Andersson 2001). Another example of that has been shown to have a small, but appreciable effect on beef tenderness pertains to the Calpain and Calpastatin (Page et al. 2002) and a genetic test was commercialized by Neogen Genomics (GeneSeek, Lincoln, NE) and Zoetis (Kalamazoo, MI). It is important to notice that all those achievements were based on few SNPs or microsatellites because of still high genotyping costs.

Although there were a few applications in cattle breeding, MAS based on a few markers was not contributing appreciably to livestock improvement simply because most of the traits of interest are quantitative and complex, meaning phenotypes are determined by thousands of genes with small effects and influenced by environmental factors. This goes back to the infinitesimal model assumed by Fisher (1918), where phenotypic variation is backed up by a large number of Mendelian factors with additive effects. Some lessons were certainly learned from the initial stab at MAS: some important genes or gene regions (quantitative trait loci or QTL) were detected; however, the same QTL were not always observed in replicated studies or in other populations, meaning most of them had small effects on the traits (T. Meuwissen, Hayes, and Goddard 2016). In addition, the number of QTL associated with a phenotype is rather subjective and depends on the threshold size of the effect used for identifying QTL (Andersson 2001). Simply put, it appears there are only a few genes that contribute more than 1% of the genetic variation observed between animals for any given polygenic trait.

Initial allure of MAS led to a massive redirecting of grant funds to this type of research, greatly contributing to the current shortage of qualified quantitative geneticists in animal breeding (Ignacy Misztal n.d.). Despite some of the initial setbacks using MAS, in 2001, some researchers envisioned that genomic information could still help animal breeders to generate more accurate breeding values, if a dense SNP assay that covers the entire genome became available. Extending the idea of incorporating marker information into BLUP (using genotypes, phenotypes and pedigree information), introduced by(R. L. Fernando and Grossman 1989), Meuwissen et al. (2001) proposed some methods for what is now termed genome-wide selection or genomic selection (GS). This paper used simulation data to show that accuracy of selection was doubled using genomic selection compared to using only phenotypes and pedigree information. With the promise of large accuracy gains, this paper generated enormous excitement in the scientific community. Some conclusions from this study included: 1) using SNP information can help to increase genetic gain and to reduce the generation interval; 2) the biggest advantage of genomic selection would be for traits with low heritability; 3) animals can be selected early in life prior to performance or progeny testing. With all of this potential, genomic selection was an easy sell.

However, it took about 8 years from the publication of the Meuwissen et al. (2001) paper until the dense SNP assay required for genomic selection became available for cattle. Researchers from USDA, Illumina, University of Missouri, University of Maryland, and University of Alberta developed a SNP genotyping assay, allowing the genotyping of 54,001 SNP in the bovine genome (Illumina Bovine50k v1; Illumina, San Diego, CA). The initial idea of this research was to use the SNP assay or chip for mapping disease genes and QTLs linked to various traits in cattle (Matukumalli et al. 2009). In 2009, a report about the first bovine genome entirely sequenced (Consortium et al. 2009) was published as an output of a project that cost over $50 million and involved about 300 researchers. With the cattle sequence known, it was possible to estimate the number of genes in the bovine genome: somewhere around 22,000. Armed with the tools to generate genomic information, GS became a reality.

Among all livestock industries in USA, the dairy industry was the first to use genomic selection. More than 30,000 Holstein cattle had been genotyped for more than 40k SNP by the end of 2009 (https://www.uscdcb.com/Genotype/cur_density.html). In January of 2009, researchers from AGIL-USDA released the first official genomic evaluation for Holstein and Jersey. Still in 2009, Angus Genetics Inc. started to run genomic evaluations, but with substantially fewer genotypes, which was also true for other livestock species. After the first validation exercises, the real gains in accuracy were far less than those promised in (T. H. E. Meuwissen, Hayes, and Goddard 2001). This brought some uncertainties about the usefulness of GS that were later calmed by understanding that more animals should be genotyped to reap the full benefits of GS. VanRaden et al. (2009) showed an increase in accuracy of 20 points when using 3,576 genotyped bulls, opposed to 6 points when using 1,151 bulls. Now, in 2017, Holstein USA has almost 1.9 million and the American Angus Association has more than 400,000 genotyped animals.

When GS was first implemented for dairy breeding purposes, all the excitement was around one specific Holstein bull nicknamed Freddie (Badger-Bluff Fanny Freddie), which had no daughters with milking records in 2009 but was found to be the best young genotyped bull in the world (VanRaden, personal communication). In 2012 when his daughters started producing milk, his superiority was finally confirmed. Freddie’s story is an example of what can be achieved with GS, as an animal with high genetic merit was identified earlier in life with greater accuracy. With the release of genomic estimated breeding values (GEBV), the race to genotype more animals started.

The availability of more genotyped cattle drove the development of new methods to incorporate genomic information into national cattle evaluations. The first method was called multistep, and as the name implied, this method required multiple analyses to have the final GEBVs. Distinct training and validation populations were needed to develop molecular breeding values (MBV) or direct genomic values (DGV), which were blended with traditional EBVs or included as correlated traits (Kachman et al. 2013). This multistep model was the first one to be implemented for genomic selection in the USA. Several studies examining the application of multistep in beef cattle evaluation have been published (Saatchi et al. 2011 ; Snelling et al. 2011). The main advantage of this approach is that the traditional BLUP evaluation is kept unchanged and genomic selection can be carried out by using additional analyses. However, this method has some disadvantages: a) MBV are only generated for simple models (i.e., single trait, non-maternal models), which is not the reality of genetic evaluations; b) it requires pseudo-phenotypes (EBVs adjusted for parent average and accuracy); c) pseudo-phenotypes rely on accuracy obtained via approximated algorithms, which may generate low quality output; d) only genotyped animals are included in the model; e) MBV may contain part of parent average, which leads to double counting of information.

As only a fraction of livestock are genotyped, Misztal et al. (I. Misztal, Legarra, and Aguilar 2009) proposed a method that combines phenotypes, pedigree, and genotypes in a single evaluation. This method is called single-step genomic BLUP (ssGBLUP) and involves altering the relationships between animals based on the similarity of their genotypes. As an example, full-sibs have an average of 50% of their DNA in common, but in practice this may range from 20% to 70% (D. A. Lourenco et al. 2015). The ssGBLUP has some advantages over multistep methods. It can be used with multi-trait and maternal effect models, it avoids double counting of phenotypic and pedigree information, it ensures proper weighting of all sources of information, and it can be used with both small and large populations and with any amount of genotyped animals. Overall, greater accuracies and less inflation can be expected when using ssGBLUP compared to multistep methods. Not long after the implementation of GS, single-step was first applied to a dairy population with more than 6,000 genotyped animals (I. Aguilar et al. 2010 ; O. F. Christensen and Lund 2010).

An early application of ssGBLUP in beef cattle used simulated data with 1500 genotyped animals in an evaluation for weaning weight with direct and maternal effects (D. Lourenco et al. 2013). Although a small number of genotyped animals was used, gains in accuracy were observed for both direct and maternal weaning weight. Next ssGBLUP was applied to a real breed association data set (D. a. L. Lourenco et al. 2015). This study showed a comprehensive genomic evaluation for nearly 52,000 genotyped Angus cattle, with a considerable gain in accuracy in predicting future performance for young genotyped animals. This gain was on average 4.5 points greater than the traditional evaluations.

4 Quick look at SNP Data

The most abundant polymorphisms at the DNA level are SNPs: Single Nucleotide Polymorphisms. By the art of biochemistry and the joint effort of industry and academia, it is now possible to massively (many of them), accurately (the genotype read is the actual genotype) and economically (the cost is relatively low) read the same set of SNPs across several individuals; the technology is commonly called SNP chips or SNP genotyping. For more information, you may read, for instance, https://www.illumina.com/techniques/popular-applications/genotyping.html .

Triallelic SNPs exist in nature but they are not used for SNP chips. Thus, the possible alleles for SNP loci are all pairwise combinations among (A,C,G,T): A/C, A/G, A/T, C/G, C/T, G/T.

4.1 From crude SNP file to usable genotype file

I (A.L.) do not have much, but I have some, experience in dealing with crude SNP data. These are, for practical purposes such as national genomic evaluations, handled by experienced teams and read, stored in databases; for instance, description of such a process is in (Wiggans et al. 2010; Groeneveld and Lichtenberg 2016). However, it is good to be exposed to the crude output of genotyping. This is an excerpt from some real analysis; the name of this file was ..._Custom-FinalReport.txt:

[Header]
GSGT Version    1.9.4
Processing Date 3/16/2012 9:11 AM
Content         OvineSNP50_B.bpm
Num SNPs        54241
Total SNPs      54241
Num Samples     36
Total Samples   36
[Data]
Sample ID       Sample Name     SNP Name        Allele1 - Top   Allele2 - Top   GC Score
ES140000270478  PLACA_CIC_12_96 250506CS3900065000002_1238.1    G       G       0.8932
ES140000270478  PLACA_CIC_12_96 250506CS3900140500001_312.1     A       G       0.7341
ES140000270478  PLACA_CIC_12_96 250506CS3900176800001_906.1     A       G       0.7532
ES140000270478  PLACA_CIC_12_96 250506CS3900211600001_1041.1    A       A       0.9674
ES140000270478  PLACA_CIC_12_96 250506CS3900218700001_1294.1    G       G       0.8178
ES140000270478  PLACA_CIC_12_96 250506CS3900283200001_442.1     C       C       0.6684
ES140000270478  PLACA_CIC_12_96 250506CS3900371000001_1255.1    G       G       0.4565
ES140000270478  PLACA_CIC_12_96 250506CS3900386000001_696.1     A       A       0.4258
ES140000270478  PLACA_CIC_12_96 250506CS3900414400001_1178.1    G       G       0.8690
ES140000270478  PLACA_CIC_12_96 250506CS3900435700001_1658.1    A       A       0.5153
ES140000270478  PLACA_CIC_12_96 250506CS3900464100001_519.1     A       G       0.8116
ES140000270478  PLACA_CIC_12_96 250506CS3900487100001_1521.1    A       G       0.7448
ES140000270478  PLACA_CIC_12_96 250506CS3900539000001_471.1     G       G       0.5248
ES140000270478  PLACA_CIC_12_96 250506CS3901012300001_913.1     A       A       0.7413
ES140000270478  PLACA_CIC_12_96 250506CS3901300500001_1084.1    G       G       0.7990
ES140000270478  PLACA_CIC_12_96 CL635241_413.1  A       A       0.8176
ES140000270478  PLACA_CIC_12_96 CL635750_128.1  A       G       0.7978
ES140000270478  PLACA_CIC_12_96 CL635944_160.1  A       G       0.7283

This data contains genotypes for one animal: ES140000270478 for the SNPs that are listed in SNP Name. The columns Allele1 and Allele2 contain the readings in nucleotide form (Adenine, Guanine, Citosine and Thymine – A,G,C,T). For instance in SNP Name 250506CS3900065000002\_1238.1, this animal is homozygous G/G, but for CL635750\_128.1 . the animal is heterozygote A/G. The Allele1/ Allele2 notation is, for our purposes, arbitrary: we do not know which one came from the sire and which one came from the dam.

Now, you can see that the same animal ES140000270478 is repeated over and over; there is one line per marker. The file is constituted by a header, and one line per individual and per marker. At some point we arrive to the next animal:

ES140000270478 PLACA_CIC_12_96 s76040.1 G G 0.6173
ES140000270478 PLACA_CIC_12_96 s76043.1 A A 0.7965
ES150010016299 PLACA_CIC_10_02 250506CS3900065000002\_1238.1 G G 0.8932
ES150010016299 PLACA_CIC_10_02 250506CS3900140500001\_312.1 A G 0.7341
ES150010016299 PLACA_CIC_10_02 250506CS3900176800001\_906.1 A G 0.7668

And so on. There is thus a lot of redundancy here.

Another type of files has the final_report format with

[Header]
GSGT Version    1.9.4
Processing Date 01/01/2018 10:11 AM
Content     BovineSNP50_v2_C.bpm
Num SNPs    54609
Total SNPs  54609
Num Samples 1
Total Samples   1
[Data]
SNP Name    Sample ID   Allele1 - Forward   Allele2 - Forward   Allele1 - Top   Allele2 - Top   Allele1 - AB    Allele2 - AB    GC Score    X   Y
ARS-BFGL-BAC-10172  USA201811   G   G   G   G   B   B   0.9506  0.012   1.036
ARS-BFGL-BAC-1020   USA201811   G   G   G   G   B   B   0.9673  0.005   0.652
ARS-BFGL-BAC-10245  USA201811   C   C   G   G   B   B   0.7579  0.092   1.417
ARS-BFGL-BAC-10345  USA201811   A   A   A   A   A   A   0.9276  1.143   0.008
ARS-BFGL-BAC-10365  USA201811   G   G   C   C   B   B   0.5335  0.004   0.862
ARS-BFGL-BAC-10375  USA201811   A   G   A   G   A   B   0.9567  0.478   0.581
ARS-BFGL-BAC-10591  USA201811   A   G   A   G   A   B   0.9003  0.386   0.473
ARS-BFGL-BAC-10867  USA201811   G   G   C   C   A   A   0.9434  0.776   0.004
ARS-BFGL-BAC-10919  USA201811   A   A   A   A   A   A   0.8526  1.232   0.036
ARS-BFGL-BAC-10951  USA201811   T   T   A   A   A   A   0.5140  0.539   0.017
ARS-BFGL-BAC-10952  USA201811   A   A   A   A   A   A   0.9512  0.987   0.030
ARS-BFGL-BAC-10960  USA201811   G   G   G   G   B   B   0.9528  0.018   0.826
ARS-BFGL-BAC-10972  USA201811   G   C   C   G   A   B   0.8759  0.917   0.743
ARS-BFGL-BAC-10975  USA201811   A   G   A   G   A   B   0.8142  0.979   0.739
ARS-BFGL-BAC-10986  USA201811   G   G   C   C   B   B   0.9309  0.055   0.731
ARS-BFGL-BAC-10993  USA201811   C   C   G   G   B   B   0.9014  0.023   1.094
ARS-BFGL-BAC-11000  USA201811   T   T   A   A   A   A   0.9686  0.561   0.013
ARS-BFGL-BAC-11003  USA201811   T   T   A   A   A   A   0.9215  1.171   0.040
ARS-BFGL-BAC-11007  USA201811   T   C   A   G   A   B   0.9454  0.884   0.675
ARS-BFGL-BAC-11025  USA201811   G   G   C   C   B   B   0.9082  0.015   0.740
ARS-BFGL-BAC-11028  USA201811   A   G   A   G   A   B   0.9678  0.182   0.288
ARS-BFGL-BAC-11034  USA201811   T   C   A   G   A   B   0.9509  0.566   0.592
ARS-BFGL-BAC-11039  USA201811   C   C   G   G   B   B   0.9658  0.000   0.889
ARS-BFGL-BAC-11042  USA201811   A   G   A   G   A   B   0.8506  0.947   0.786
ARS-BFGL-BAC-11044  USA201811   T   C   A   G   A   B   0.9654  0.726   0.689
ARS-BFGL-BAC-11047  USA201811   T   T   A   A   A   A   0.9465  0.973   0.015

This format is apparently more confusing but it is explained here: https://www.illumina.com/documents/products/technotes/technote_topbot.pdf . In short, what we need to look at is the A/B ’s in the columns. For one marker, A and B may “mean” A and T whereas in another locus they may “mean” T and A. However, the A/B notation is less ambiguous (or more accurate) than the A/C/G/T due to the way the chemistry works. Note that the “A” in the A/C/G/T system is not the same as the “A” in the A/B system.

Anyway, there is a need to gather this information into a more condensed format. This format is usually comprised of:

    ES1400NAB40571 G G G G A A A C . . A G
    ES1400NAB40573 G G G G G G A C G G A G
    ES1400NAB40574 A G G G A G A C G G A A
    ES1400NAB40159 G G G G A G A C G G A A
    ES1400NAB40528 A G A G A G C C A G A A
    ES1500VI492705 G G A G G G A C G G A G
    ES1500SSA40533 A G G G A G C C G G A A

The . . implies that there is no lecture for this marker and individual: this is a missing genotype. SNP chips have few missing genotypes, but other technologies like GBS (Genotyping By Sequence) have very large amounts of missing genotypes. Imputation “fills in” those gaps; it will be mentioned later. Or, it could be (just making it up)

ES1400NAB40571 B B A A B B A B . . A A
...

We can compact it even further, noting that (1) SNPs are biallelic (2) the paternal/maternal origin is unknown and (3) you can represent as an integer which represents the number of copies (gene content: these wording will be used over and over) of a reference allele of the two that are polymorphic at one SNP marker. This is also known as allele coding. For most practical purposes, which one is the reference allele is irrelevant. For instance, assume that for ES1400NAB40571 the reference allele is B in all markers. So these are the integer codes:

ES1400NAB40571 2 0 2 1 5 0

Or, in a more compact way (this is what in these notes we call in these notes the “UGA format”),

ES1400NAB40571 202150

It can be seen that the correspondences are:

code genotype
0 AA
1 AB or BA
2 BB
5 missing

The reference allele can vary across loci. For instance, consider the same animal

ES1400NAB40571 G G G G A A A C . . A G

And consider that the reference alleles for each of the 6 markers are (G,G,A,C,G,A). Using these reference alleles would give

ES1400NAB4057 222151

Which is different from the coding above. Actually, the table of correspondences would be:

Code Marker 1 Marker 2 Marker 3 Marker 4 Marker 5 Marker 6
0 AA AA GG CC AA GG
1 AG or GA AG or GA AG or GA AC or CA AG or GA AG or GA
2 GG GG AA AA GG AA
5 missing missing missing missing missing missing

Note that here we put “Marker 1”, “Marker 2”, etc, but actually the names are more complex, for instance,

250506CS3900539000001_471.
250506CS3901012300001_913.1     
250506CS3901300500001_1084.1    
CL635241_413.1  
CL635750_128.1  
CL635944_160.1

for this reason, it is essential to keep track of the names of the markers that we use.

For most purposes the coding is irrelevant, but it needs to be coherent for every batch of new animals. This is why it is mandatory to, either to stick to one of the alleles (say the B) in Illumina’s A/B system, or (better) to store the whole data base with readings in the formats A/C/G/T and A/B. It is also mandatory to keep track of the names of the SNP markers in the files. Joining files with integers and no associated file with SNP names is dangerous.

There is software available to convert from the long format of Illumina output to the compact format. One example is illumina2preGS Alternatively, a self-written script or software can be used.

Note that the Plink format is different from the UGA format described here, but there are converters between formats, or you may program your own.

4.2 Basic checking of marker information

A cool feature of the integer format is that somethings are very easy to get and compute.

4.2.1 Call rates

The call rate is the number of observed genotypes:

Measuring call rate reduces to “count” the number of missing (either “. .” or “5”) in the genotype file per row or per column. Animals that have low call rate (i.e. too many markers not genotyped) are eliminated. This is often due to bad conservation of the DNA. Markers that have low call rate (i.e. they have not been read for many animals) are also eliminated. Typically, this is due to poor biochemistry.

Typical thresholds for quality control of call rates are 90% or 95%. Below this level, either the marker or the individual is discarded.

4.2.2 Allele frequencies and Minor Allele Frequencies (MAF)

The allele frequency \(p\) is simply the frequency of the reference allele. For instance, consider

ES1400NAB40571 G G
ES1400NAB40573 G G
ES1400NAB40574 A G
ES1400NAB40159 G G
ES1400NAB40528 A G
ES1500VI492705 G G
ES1500SSA40533 A A

If the reference allele is G, we have 10G against 4A: \(p = \frac{10}{14} \approx 0.71\), and the frequence of allele A is \(q = 1 - p \approx 0.29\). The funny thing is, that this is very easy to compute looking at the UGA format:

ES1400NAB40571 2
ES1400NAB40573 2
ES1400NAB40574 1
ES1400NAB40159 2
ES1400NAB40528 1
ES1500VI492705 2
ES1500SSA40533 0

From this format, we have (quite obviously) 2 “G”s for each “2”, 1 “G” for each “1” and 0 “G”s for each 0. So, quite obviously, and skipping the columns with missing information:

\[p = \frac{\text{sum\ of\ the\ column}}{2 \times number\ of\ lines}\]

For instance, this awk script computes allele frequencies:

#!/bin/awk -f
#
# This script computes allele frequencies from marker file with UGA format
# AA=0, aa=2, Aa=1, aA=1, no missing genotypes)  
#
BEGIN{ }
{
    nsnp=length($2)
    split($2,aux,"")
        for (i=1; i<=nsnp; i++){ cnt[i]=cnt[i]+aux[i]   }
}
END {
    for(i=1; i<=nsnp; i++){print(cnt[i]/(2*(NR))) }
}

The Minor Allele Frequency (MAF) is the lowest of the two allele frequencies: \(p\) and \(q = 1 - p\) ; in Fortran terms, maf=maxval((/p,q/)) . It is used as a measure of the informativity of the marker. A marker that has \(p = 1\) is said to be monomorphic and does not give much information, as all individuals are identical for this marker. Therefore, we may ignore it. But accordingly, we may ignore markers for which almost all individuals have the same genotype; for instance, if \(p = 0.9999\). Where do we put the limit? A rule of thumb for genomic prediction with SNP chips is to remove markers with MAF<5%, or MAF<1%. In practice, this does not change much the results for prediction. But if the objective is to investigate rare variants, then we should not edit markers by MAF.

4.2.3 Hardy-Weinberg equilibrium

In an unselected population, the distribution of genotypes is expected to follow Hardy-Weinberg proportions. In practice, most populations are selected, so Hardy-Weinberg equilibrium (HWE) does not always hold. We can check proportions of observed and expected counts of genotypes. If \(n\) is the total number of animals and \(n_{0},n_{1},n_{2}\) are the counts of each genotype:

Genotype 0 1 2
Observed \[n_{0}\] \[n_{1}\] \[n_{2}\]
Expected \[nq^{2}\] \[n2pq\] \[np^{2}\]

From this table, it is possible to make a statistical test to test the hypothesis that the data is under HWE. The statistic is

\[\chi^{2} = \sum_{0:2}^{}\frac{\left( \text{Observed}_{i} - \text{Expected}{i} \right)^{2}}{\text{Expected}_{i}}\]

where \(\text{Expected}(0:2) = n\left( q^{2},2\text{pq},p^{2} \right)\). Another way of getting the same statistic directly from the counts (Emigh 1980):

\[\chi^{2} = 16n\frac{\left( n_{0}n_{2} - \frac{n_{1}^{2}}{4} \right)^{2}}{\left( 2n_{0} + n_{1} \right)^{2}\left( n_{1} + 2n_{2} \right)^{2}}\]

This statistic follows a \(\chi^{2}\) distribution with 1 degree of freedom (Emigh 1980). For instance, in the above example with 7 animals:

\[\chi^{2} = \frac{\left( 1 - 7 \times {0.29}^{2} \right)^{2}}{7 \times {0.29}^{2}} + \frac{\left( 2 - 7 \times 2 \times 0.29 \times 0.71 \right)^{2}}{7 \times 2 \times 0.29 \times 0.71} + \frac{\left( 4 - 7 \times {0.71}^{2} \right)^{2}}{7 \times {0.71}^{2}} = 0.63\]

Which has a non-significant p-value of 0.43 using, in R:

pchisq(0.63,1,0,lower.tail=FALSE)

You must be very careful if you use HWE statistic to do quality control. In practice, HWE approximately holds but it never holds exactly. For this reason, with large data sets, the hypothesis is rejected. In practice, a more sensible approach is to reject the marker if the number of observed heterozygotes deviates to much from the expectation. In other words, one marker may be rejected if

\[\left| \frac{n_{1}}{n} - 2pq \right| > t\]

For some threshold \(t\). The value used by default in the BLUPF90 suite of programs is 0.15 following (Wiggans et al. 2009).

4.2.4 Genotypic frequencies in crosses

HWE does not hold in crosses, for instance in F1 crosses, so it should not be checked. We can however present what should be the genotypic frequencies in F1 crosses. If alleles frequencies in breed A and breed B are \(p_{A}\) and \(p_{B}\), then the \(\text{Expected}(0:2) = n\left( q_{A}q_{B},p_{A}q_{B} + p_{B}q_{A},p_{A}p_{B} \right)\). This may be useful to check your data.

4.2.5 Sex chromosomes and unmapped markers

The sex chromosomes (X and Y in mammals, Z and W in birds) present some complexities for genomic analysis. Females in mammals carry two alleles at sex chromosomes, but males carry two alleles only in the pseudo-autosomal part (chromosome Y and its counterpart in X). Therefore, these chromosomes are almost systematically eliminated from the analysis. Literature presents methods to deal with sex-linked inheritance, both in the classical pedigree way (R. L. Fernando and Grossman 1990) and in the genomic way (Su et al. 2014).

Maps (physical situation of markers in chromosomes) are typically constructed by consortia (e.g. http://bovinegenome.org , http://www.sheephapmap.org ). It happens that markers may be genotyped but the exact situation is unknown and these are markers are assigned to “chromosome 0”. These markers are typically discarded – they are very few and not knowing the position makes some analysis difficult.

4.2.6 Mendelian conflicts and assigning parents

Some genotypes might be incompatible with the declared pedigree. The most typical cases are (1) conflicting genotypes one parent and offspring: a father “AA” can not sire an offspring “aa”, and (2) conflicting genotypes both parents and offspring: “AA” x “AA” cannot sire “Aa”. If such an event is found looking at genotype and pedigree data, it may be a single genotyping error – however, if several of them are found for a couple or trio of individuals, either there is a problem in the pedigree or a misidentification of the sample.

One of the possibilities, if there is a conflict or the sire/dam is not in the pedigree, is to find the sire or dam of one individual based on the observed genotypes B. Hayes (2011). One such program is seekparentf90.

4.2.7 Duplicate genotypes

Unless there are clones or monozygotic twins, we do not expect identical genotypes – so, a very high concordance of genotypes of two animals is suspicious. In most cases, this is due to mislabeling – two DNA samples from the same animal received two different name tags.

5 A quick tour of Linkage Disequilibrium

The aim of this section is not really to make a full description, which is beyond the scope of these notes, but to give a few concepts that might be of relevance for practitioners.

In a genome there are many loci and loci have alleles. In a population, there is a certain distribution of alleles within a locus but also across loci. This distribution can be described by a regular table. For instance, assume two biallelic loci and that we have 5 individuals, and therefore 10 gametes in our population:

\[\{ AB,\ AB,\ ab,\ aB,\ ab,\ ab,\ Ab,\ AB,\ Ab,\ AB\}\]

You may call this: haplotypes, diplotypes, or genotypes of the gametes.

Consider first allelic frequencies within loci are: for the first locus,

\[p_1 = \text{freq}(A) = 0.6\]

; for the second locus,

\[p_2 = \text{freq}( B ) = 0.5\]

However, to consider the joint frequency at the two loci, we need a frequency table of these diplotypes, as follows:

Example of two loci in Linkage disequilibrium
freqs A a
B 0.4 0.2
b 0.1 0.3

The eye sees that allele “A” comes most often associated with “B”. But is this any relevant? Does the presence of “A” give any clue on the presence of “B”?

Linkage equilibrium is a common assumption. In linkage equilibrium, alleles across loci are distributed at random. For instance, \(\text{freq}\left( \text{AB} \right) = \text{freq}\left( A \right) \times \text{freq}\left( B \right) = 0.30\). If these were the case, the table should be as follows:

Frequency table if the two loci were in Linkage equilibrium
A a
B 0.3 0.2
b 0.3 0.2

Linkage disequilibrium (LD) is the event of non-random association of alleles across loci, and it means that the “observed” table deviates from the “expected” table. The reason why linkage disequilibrium is formed is because some “chunks” (or segments) of chromosomes are overrepresented in the population and never break down, and this is basically due to finite size of the population (drift, selection) and also to mutation. For instance, consider a cross of two inbred lines and successive F1, F2…Fn generations. At the end, the chromosomes become a fine-grained mosaic of grey and black. However, complete mixture is difficult to attain.

Chunks of ancestral chromosomes after cross of pure lines and several generations

Linkage disequilibrium describes not-random association of two loci. Nothing more, so, why is it useful? In practice, two loci in LD most often are (very) close. This is because LD breaks down with recombination. Therefore, Linkage disequilibrium of two loci decays on average with the distance, and it serves to map genes. In other words, one locus is a proxy for the other one, and this is why association analysis implicitly uses linkage disequilibrium to map genes.

5.1 Within-family and population linkage disequilibrium

If we study the distribution of alleles within a family (say parents and offspring) we will verify that the linkage disequilibrium is very strong. This is because the chromosomes of the parents are almost completely conserved, because there are very few recombinations in one generation time. Consider for instance the following two sires, and a recombination fraction of 0.25 across the two loci:

Two sires and eight gametes of the progeny, where each family shows linkage disequilibrium but there is no population linkage disequilibrium

Individually considered, the two families have strong within-family linkage disequilibrium. In family 1, pairs “AB” and “ab” come together, but in family 2 pairs “Ab” and “aB” come together. Still, the population of 16 offspring seen as a whole does not have linkage disequilibrium.

However, populations are large families. Therefore, there will be linkage disequilibrium across loci if we look at distances short enough. In general, short-distance linkage disequilibrium reflects old relationships and large-distance linkage disequilibrium reflects recent relationships Tenesa et al. (2007) .

5.1.1 Why QTL are easier to trace within family

Now imagine that locus A/a was a QTL with effects of, say, \(\{ + 10, - 10\}\) and locus B/b was a genetic marker. It is very easy to trace the QTL within each family, but the two pieces of information from each family are contradictory when pooled together. Locus B/b would have apparent effects of \(\{ 5, - 5\}\) in family one but \(\{ - 5,5\}\) in family two. This can be explained as follows. The four chromosomes carriers of locus B in family one carry three copies of allele A and one copy of allele a. Therefore, the apparent effect of allele B is equal to \(\frac{\left( 3 \times 10 + 1 \times \left( - 10 \right) \right)}{4} = 5\), in family one. In family two this is exactly the opposite: \(\frac{\left( 1 \times 10 + 3 \times \left( - 10 \right) \right)}{4} = - 5\), and across all families, Locus B/b would have an effect of

\(\frac{\left( 3 \times 10 + 1 \times \left( - 10 \right) \right)+ \left( 1 \times 10 + 3 \times \left( - 10 \right) \right)}{4+4}=0\) . Therefore, allele B is a good predictor both within families 1 and 2, but not across families.

5.2 Quantifying linkage disequilibrium from gametes’ genotypes or from individuals’ genotypes

There are two classical measures. \(D\) measures the deviation from observed distribution to expected distribution:

\[D = freq\left( \text{AB} \right) - freq\left( A \right)freq(B)\]

Hill and Robertson (1968) proposed, for biallelic loci, to assign numerical values based on gene contents (i.e., \(\{ A,a\}\) would be \(\{ 0,1\}\) and \(\{ B,b\}\) would be \(\{ 0,1\}\)) and compute Pearson’s correlation across loci. In the preceding example, genotypes at gametes: \(\left\{ \text{AB},\ \text{AB},\ \text{ab},\ \text{aB},\ \text{ab},\ \text{ab},\ \text{Ab},\ \text{AB},\ \text{Ab},\ \text{AB} \right\}\) can be written as two variables, one \(X\) for “A”,

\[X = \left\{ 1,1,0,0,0,0,1,1,1,1 \right\}\]

and one \(Y\) for “B”,

\[Y = \left\{ 1,1,0,1,0,0,0,1,0,1 \right\}\]

We can get the correlation from R

X=c(1,1,0,0,0,0,1,1,1,1)
Y=c(1,1,0,1,0,0,0,1,0,1)
cor(X,Y)
0.4082483

and therefore \(r = 0.41\). It can be shown that \(r = \frac{D}{\sqrt{p_{A}q_{A}p_{B}q_{B}}}\) where \(p_{A} = 1 - q_{A} = \text{freq}(A)\). It has the advantage that \(r^{2}\) is related to the variance in locus A explained by locus B, and of being easier to understand than \(D\). Both \(D\) and \(r\) depend on the reference allele (e.g. it is not the same to use as a reference \(A\) or \(a\)) but \(r^{2}\) is invariant to the reference allele.

We just said that we need genotypes at gametes. This implies that we need to know the phases of the genotypes. But the phases are not known, although they may be deduced using some phasing software. We may still use Hill 1968 and compute correlations of gene contents. Our example was:

\[\left\{ AB,\ AB,\ ab,\ aB,\ ab,\ ab,\ Ab,\ AB,\ Ab,\ AB \right\}\]

But we actually have 5 individuals, with genotypes (note the semicolon separating individuals):

\[\left\{ AB,\ AB;\ ab,\ aB;\ ab,\ ab;\ Ab,\ AB;\ Ab,\ AB \right\}\]

If we put this in form of gene content it gives the following table:

\[\mathbf{X}\] 2 0 0 2 2
\[\mathbf{Y}\] 2 1 0 1 1

And therefore we get a correlation as

X=c(2,0,0,2,2)
Y=c(2,1,0,1,1)
cor(X,Y)
[1] 0.6454972

In this example, this value of \(r = 0.65\) is not quite the previous estimate of \(r = 0.41\), but in practice using genotypes instead of phased gametes results in good estimates (Rogers and Huff 2009).

When the effective population size (Ne) is small, the chromosome segments are longer, and LD is stronger. If we compare beef and dairy cattle populations, LD would be stronger for dairy cattle because of the smaller Ne. The LD also depends on recent and precious recombination events, as it is broken down by recombination. In bovines, moderate LD is observed in distances smaller than 0.1cM and strong values ( \(r^2\) = 0.8) are observed in very short distances.

6 Quantitative genetics of markers, or markers as quantitative traits

6.1 Gene content as a quantitative trait

This small chapter wants to put forward an idea that goes often unnoticed and that was highlighted by (Nicolas Gengler et al. 2008; N. Gengler, Mayeres, and Szydlowski 2007). A detailed but terse account is in (C. C. Cockerham 1969). Consider a marker, not necessarily biallelic. An individual is carrier of a certain number of copies, either 0, 1 or 2. This number of copies is usually called gene content (sometimes also called individual gene frequencies, a confusing term).

For instant consider the blood groups AB0 (multiallelic) or Rh (biallelic +/-) the following table:

Example of gene content for multiallelic blood group
Individual Genotype Gene count for A Gene count for B Gene count for 0
John AB 1 1 0
Peter A0 1 0 1
Paul 00 0 0 2
Example of gene content for biallelic blood group
Genotype at Rh Gene count for + Gene count for -
John ++ 2 0
Peter + - 1 1
Paul - - 0 2

For a biallelic marker, the table is simpler, because the gene content with one reference allele will be 2 minus the gene content of the other allele.

For this reason, in the next, we will denote the gene content of individual \(i\) as \(z_{i}\) which will take values {0,1,2}.

The gene content can thus be “counted”, just as we count milk yield, height, or number of piglets born. The funny thing is that gene content can also be studied as a quantitative measure - just like milk yield, height, or number of piglets born-, and it can be therefore studied as a quantitative trait (although it is not a continuous trait). Therefore, gene content can be treated by standard quantitative genetics methods. In the following we will deal with gene content of biallelic markers such as SNPs but many of the results apply to multiallelic markers such as haplotypes or microsatellites.

6.2 Mean, variance and heritability of gene content

If the alleles are \(\{ A,a\}\) in a population, and A is the reference allele, the average gene content \(E\left( z \right)\) is equal to the number of occurrences of A, which is twice the allelic frequence: \(E\left( z \right) = 2p\). In Hardy-Weinberg equilibrium, the variance of gene content is calculated as:

\[\text{Var}\left( z \right) = E\left( z^{2} \right) - E\left( z \right)^{2}\]

Table 4. Variance of gene content

Genotype Frequency \(z^{2}\) \(z\)
AA \(p^{2}\) \(4\) \(2\)
Aa \(2pq\) \(1\) \(1\)
Aa \(q^{2}\) \(0\) \(0\)
Average \(4p^{2} + 2pq\) \(2p\)

The expectation \(E\left( z^2 \right)\) can be computed by weighting the column \(z^{2}\) with the column Frequency. Therefore \(\sigma_{z}^{2} = \text{Var}\left( z \right) = 4p^{2} + 2{pq} - \left( 2p \right)^{2} = 2{pq}\)

The heritability of gene content is the ratio of genetic to environmental variances. Clearly, all variance is genetic because the gene content is fully determined by transmission from fathers to offspring, and all the genetic variance is additive because gene content is additive by construction (if you think on it, the substitution effect is exactly \(\alpha = 1\)). Also, there is no residual error as the gene content is measured (in principle) perfectly. Therefore, the heritability is 1.

6.3 Covariance of gene content across two individuals.

Let’s write the gene content of two individuals \(i\) and \(j\) as \(z_{i},z_{j}\) . The covariance is \(Cov\left( z_{i},z_{j} \right)\). Individuals \(i\) and \(j\) have two copies at the marker. If we draw one copy from \(i\) and another from \(j\), the probability of them being identical (by descent) is \({\Theta_{\text{ij}} = A}_{\text{ij}}/2\), where \(\Theta\) is known as Malecot “coefficient de parenté”, kinship, or coancestry and \(A_{\text{ij}}\) is the additive relationship. This is just standard theory – two alleles from two individuals are identical if they are IBD. Therefore

\[Cov\left( z_{i},z_{j} \right) = E\left( z_{i}z_{j} \right) - E\left( z_{i} \right)E\left( z_{j} \right)\]

\(E\left( z_{i} \right) = E\left( z_{j} \right) = 2p\). \(E\left( z_{i},z_{j} \right)\) can be obtained by as follows. There are four ways to sample two alleles. For each way, the product \(z_{i}z_{j}\) will be 1 only in two cases. The first one is that the first individual got the allele A (with probability \(p\)) and the second one got A as well because it was identical by descent (with probability \(A_{\text{ij}}/2\)), which yields a probability of \(pA_{\text{ij}}/2\). The second case is that the first individual got the allele A (with probability \(p\) , the second individual was not identical by descent (with probability \(1 - A_{\text{ij}}/2\)) , but at the same time by chance the second individual had the “A” allele with probability \(p\), which yields a probability of \(p(1 - A_{\text{ij}}/2)p\). Summing both probabilities we have \(pA_{\text{ij}}/2 + p(1 - A_{\text{ij}}/2)p = pqA_{\text{ij}}/2 + p^{2}\), and multiplying by four possible ways gives \(E\left( z_{i}z_{j} \right) = A_{\text{ij}}2pq + 4p^{2}\). Putting all together gives

\[\text{Cov}\left( z_{i},z_{j} \right) = A_{\text{ij}}2pq\]

which means that the covariance between relatives at gene content is a function of their relationship \(A_{\text{ij}}\) and the genetic variance of gene content \(2\text{pq}\). In other words, two related individuals will show similar genotypes at the markers. This result was utilized by (Nicolas Gengler et al. 2008; N. Gengler, Mayeres, and Szydlowski 2007; D. Habier, Fernando, and Dekkers 2007).

Extending the result above implies that the gene content in a population can be described like any other trait:

\[E(\mathbf{z}) = \mathbf{2}p\]

\[Var(\mathbf{z}) = \mathbf{A}2pq\]

where \(\mathbf{A}\) is the classical numerator relationship matrix.

6.4 Quality control using heritability of gene content

This was explored by (Forneris et al. 2015). If gene content is a quantitative trait, we can estimate its heritability. We just need a pedigree file and a data file, although the data is now gene content. The method simply consists in modelling the genotype \(\mathbf{z}\) as a quantitative trait:

\[\mathbf{z = 1}\mu + \mathbf{Wu + e}\]

Where \(\mathbf{W}\) is a matrix of incidence with 1’s for genotyped individuals and 0 otherwise. This is how the data file looks like:

1 1 533
0 1 1732
2 1 1207
1 1 952
0 1 678
1 1 2299
0 1 2581
1 1 2845
1 1 3123

(gene content, overall mean, animal id). Then we can use REML to estimate the heritability 1. The result of REML is something like:

Final Estimates
Genetic variance(s) for effect 2
0.34311
Residual variance(s)
0.56669E-04
...
h2 - Function: g_2_2_1_1/(g_2_2_1_1+r_1_1)
Mean: 0.99983

REML can not estimate a heritability of exactly 1, but it should yield almost 1. If not (for instance if \({\widehat{h}}^{2} < 0.98\) ) we have a problem, either in the genotypes or in the pedigree.

6.5 Gengler’s method to estimate missing genotypes and allelic frequencies at the base population

A common case is a long pedigree where some, typically young, animals have been genotyped for a major gene (for instance, DGAT1) of interest. It would be useful to have the genotype at the major gene for all individuals (Kennedy, Quinton, and Van Arendonk 1992). Using expressions above, (Nicolas Gengler et al. 2008; N. Gengler, Mayeres, and Szydlowski 2007) suggested a way to estimate gene content for all individuals in a pedigree, as well as allele frequencies. The method simply consists in modeling the genotype \(\mathbf{z}\) as a quantitative trait (just like in the previous section):

\[\mathbf{z = 1}\mu + \mathbf{Wu + e}\]

where \(\mathbf{W}\) is a matrix of incidence with 1’s for genotyped individuals and 0 otherwise. A heritability of \(0.99\) is used to estimate it through mixed model equations; on exit, \(\widehat{\mathbf{u}}\) contains estimates of gene content for all individuals (these are equal to the observed genotype for the genotyped individuals) and \(\widehat{\mu}\) actually contains \(2\widehat{p}\).

The method has some defaults, mainly, the estimate of gene content is a regressed estimate and therefore individuals tend to be more alike at the major gene than what they actually are. For instance, isolated individuals will have an estimate consisting in \(2\widehat{p}\). However, Gengler method is very important for two reasons: the first is that it provides an analytical tool to deal with gene content at missing genotypes (and it was completed by (O. F. Christensen and Lund 2010) and second, it serves to estimate allelic frequencies at the base population when it is not genotyped (P. M. VanRaden 2008). It also forms the bases of the gene content multiple-trait BLUP that is briefly described next.

6.6 Gene Content Multiple-Trait BLUP

GCMTBLUP can be seen as “Single Step GBLUP for one major gene”. The reader is referred to (Andrés Legarra and Vitezica 2015) for details and also for simulated examples. Assume that we have a “normal” trait (say growth) and also one major gene. The method of “heritability of gene content” can be expanded to include both the “normal” trait \(y\) and the gene content \(z\). For instance:

\[\mathbf{y =}\mathbf{X}_{y}\mathbf{b}_{y}\mathbf{+}\mathbf{W}_{y}\mathbf{u}_{y} + \mathbf{e}_{y}\]

\[\mathbf{z =}\mathbf{X}_{z}\mathbf{b}_{z} + \mathbf{W}_{z}\mathbf{u}_{z} + \mathbf{e}_{z}\]

With genetic covariance across traits as described above, \(\mathbf{G}_{0} = \begin{pmatrix} \sigma_{u_{y}}^{2} & \sigma_{u_{z,y}} \\ \sigma_{u_{z,y}} & \sigma_{u_{z}}^{2} \\ \end{pmatrix}\). The estimated covariance \(\sigma_{u_{z,y}} = 2\text{pqa}\) is a function of the effect (\(a\)) of the major gene in z on the normal trait in y. This method is more accurate than Gengler’s method because it estimates gene content for animals that have not been genotyped based on gene content of relatives but also on “normal” trait information. It actually comes in two flavors: Gene Content Multiple-Trait BLUP (GCMTBLUP) uses estimated \(\sigma_{u_{z,y}}\) to estimate EBV’s that include the major gene, in an optimal way. But if the gene effect is not known with certitude, \(\mathbf{G}_{0}\) and therefore the effect of the major gene can be estimated by REML (GCMTREML).

7 Imputation in a nutshell

7.1 ­­­­­­­Classical imputation

Imputation has become part of the regular toolkit of genomic prediction. In essence, the problem is the following. Not all animals have the same kind of genomic information. Omitting the case of sequenced animals, here are the typical cases:

In addition, there is also the problem that for many animals, some (very few) markers are not genotyped. So that, if there are 50,000 markers in one chip, for a typical animal only 49,800 markers are genotyped. Another more complex cases are Genotyping By Sequence (GBS) and sequencing, but we will not detail such here.

The theory for imputation in animal breeding is well summarized in (Paul M. VanRaden et al. 2011 ; Hickey et al. 2011). Output of the programs is usually exact genotypes (the genotype is assumed exactly known), fractional genotypes (probabilities of each genotype) or missing (the genotype of this particular marker and individual is too inaccurate to be imputed). The algorithm for imputation typically proceeds combining two sources of information:

  1. If in one individual, a chunk (short enough to assume that there is no recombination) of a chromosome (the paternal or the maternal) can be unambiguously identified as coming from one of the four chromosomes of its parents, then the whole chunk has been transmitted. This is fast and efficient if there are individuals with genotypes and pedigree.

  2. If in one individual, at one chunk of a chromosome, a set of markers form a particular pattern, that resembles closely patterns that are already known and that are present in the population, then the “holes” are filled in according to the “known” pattern. This is linkage-disequilibrium based imputation.

A reminder. What imputation can do:

  1. fill in holes from “lower” to “higher” densities (6K to 50K, 50K to 700K, 700K to sequence)

  2. fill in missing markers in the genotypes. For instance, for an animal with a call rate of 99% for a 50K SNP chip, imputation can complete the 500 missing genotypes. This is useful.

Some animals can be imputed without own genotypes using information from genotyped offspring (around 5 offspring gives a decent imputation). In all other cases, it is very hard to impute animals that have not been genotyped for any marker.

7.1.1 Quick and dirty imputations

These forms of imputation are not recommended, but it might be useful for quick studies or prototyping, or if the number of missing genotypes is really small (say, individual and animal call rates \(\approx 0.99\) ). One form is just assigning the most frequent genotype (“AA”, or “Aa”, or whatever). Another form is simply assigning genotypes at random based on drawing the genotypes from a distribution with probabilities \(\left( p^{2},2pq,q^{2} \right)\); in R this would be

z=sample(c(0,1,2),1,prob=c(p^2,2*p*q,q^2))

Again, this is not recommended. For instance, it can easily give parent-offspring incompatibilities.

7.1.2 Linear imputation

Gengler’s (2007) method cites above “imputes” “genotypes” using a linear method (BLUP) for a linear trait (gene content). It ignores all the neighboring markers and it also ignores the Mendelian nature of inheritance of markers, i.e. the offspring of a couple “AA” x “aa” is forcedly “Aa”. But the interesting point of Gengler’s method is that it can be described analytically, which will eventually lead to the development of Single Step GBLUP (O. F. Christensen and Lund 2010). In particular, the usefulness of the method is because it gives a framework for the error in the imputation. We will see this later.

8 Bayesian inference

Bayesian inference is a form of statistical inference based on Bayes’ theorem. This is a statement on conditional probability. We know that

\[p\left( A,B \right) = p\left( \text{A\ } \middle| B \right)p\left( B \right) = p\left( B \middle| A \right)p\left( A \right)\]

Bayes’ theorem says that

\[p\left( B \middle| A \right) = \frac{p\left( A \middle| B \right)p\left( B \right)}{p\left( A \right)}\]

The algebra is valid for either a single-variable \(A\) and \(B\) or for \(A\) and \(B\) representing a collection of things (e.g., \(A\) can be thousands of phenotypes and \(B\) marker effects and variance components).

Its use in statistical inference is as follows. We want to infer values of \(B\) (effects, for instance) knowing \(A\) (observed phenotypes). For every value of \(B\) we do the following:

  1. We compute \(p\left( A \middle| B \right)\), which is the probability, or likelihood, of \(A\) had we know \(B\).

  2. We multiply this probability by the “prior” probability of \(B\), \(p(B)\).

  3. We cumulate \(p\left( A \middle| B \right)p\left( B \right)\) to form \(p(A)\), which is called the marginal density of \(A\).

8.1 Example of Bayesian inference

Assume that we have a collection of quantitative phenotypes \(\mathbf{y}\mathbf{=}\{ 1,0, - 0.8\}\) with \(k=3\) records and a very simple model \(\mathbf{y} = \mathbf{1}\mu + \mathbf{e}\) with \(Var\left( \mathbf{e} \right) = \mathbf{R} = \mathbf{I}\sigma_{e}^{2}\) and \(\sigma_{e}^{2} = 1\). We will infer \(\mu\) based on Bayes’ theorem; actually, we will infer a whole distribution for \(\mu\), what is called the posterior distribution, based on

\[p\left( \mu \middle| \mathbf{y} \right) = \frac{p\left( \mathbf{y} \middle| \mu \right)p\left( \mu \right)}{p\left( \mathbf{y} \right)}\]

where

\[p\left( \mathbf{y} \middle| \mu \right) = MVN\left( \mu,\mathbf{I} \right) = \frac{1}{\sqrt{2\pi^{k}}\left| \mathbf{R} \right|} \exp\left( - \frac{1}{2}\left( \mathbf{y - 1}\mu \right)^{'}\mathbf{R}^{- 1}\left( \mathbf{y - 1}\mu \right) \right)\]

is the “likelihood” of the data for a given value of \(\mu\).

However, it is unclear what \(p(\mu)\) means. This is usually interpreted as a prior distribution for \(\mu\), which means that we must give probability values to each possible value of \(\mu\). These probabilities may come from previous information or just from mathematical or computational convenience, but they must not come from the data \(\mathbf{y}\). Prior distributions require a mental exercise of thinking if \(\mu\) has been “drawn” from some distribution (e.g., it is a particular farm among a collection of farms), or if there are biological laws that impose prior information – for instance, the infinitesimal model suggests normal distribution for genetic values. If this is the case, such an effect is often called “random” in the jargon.

Finally, \(p(\mathbf{y})\) is the probability of the data if we average \(p\left( \mathbf{y} \middle| \mu \right)\) across all possible values of \(\mu\), weighted by its probability \(p\left( \mu \right)\).

Consider that there are only two possible values of \(\mu\), -1 and 1 with equal a priori probabilities of 0.5 and 0.5. Then we can create this table:

example of Bayesian inference with two a priori values for \(\mu\)
\[p(\mu)\] \[p(\mathbf{y} | \mu)\] \[p( \mathbf{y} | \mu)p( \mu)\] \[p(\mu|\mathbf{y})=\frac{p(\mathbf{y}|\mu)p(\mu)}{p(\mathbf{y})}\]
\[\mu = - 1\] 0.5 0.0051 0.00255 0.40
\[\mu = 1\] 0.5 0.0076 0.00381 0.60
\[p(\mathbf{y})\] 0.00636

So, the final result is that the mean \(\mu\) has a value of either -1 (with posterior probability 0.40) or 1 (with posterior probability 0.60). The posterior expectation of the mean is \(E\left( \mu|\mathbf{y} \right) = 1 \times 0.60 + - 1 \times 0.40 = 0.20\).

If the prior distribution for the mean is continuous, for instance \(N(0,\sigma_{\mu}^{2})\) (say \(\sigma_{\mu}^{2} = 10\), then the final distribution of \(\mu\) is continuous as well. Therefore, it is impossible to enumerate all cases as above. In the case that the prior distribution is normal and the likelihood too, the posterior distribution can be derived analytically (e.g. in (Sorensen and Gianola 2002) and is

\[p\left( \mu \middle| \mathbf{y} \right) = N\left( \widehat{\mu},lhs^{- 1} \right)\]

where

\[lhs = \frac{\mathbf{1}^{'}\mathbf{1}}{\sigma_{e}^{2}} + \frac{1}{\sigma_{\mu}^{2}}\]

\[\widehat{\mu} = \left( \text{lhs}^{- 1} \right)\mathbf{1}^{'}\mathbf{y}\sigma_{e}^{2}\]

So, \(\widehat{\mu} = 0.064\) on average with a standard deviation of 0.57.

8.2 The Gibbs sampler

Things get more complicated when we have several unknowns in our model. For instance, we might not know the residual variance \(\sigma_{e}^{2}\), so we want to evaluate

\[p\left( \mu,\sigma_{e}^{2} \middle| \mathbf{y} \right) = \frac{p\left( \mathbf{y} \middle| \mu,\sigma_{e}^{2} \right)p\left( \mu \right)p\left( \sigma_{e}^{2} \right)}{p\left( \mathbf{y} \right)}\]

Writing down in closed form the posterior distributions is impossible. The Gibbs sampler is a numerical Monte Carlo technique that allows drawing samples from such a distribution. The idea is as follows. If we knew \(\mu\), then we could derive the posterior distribution of \(\sigma_{e}^{2}\). If we knew \(\sigma_{e}^{2}\), then we could derive the posterior distribution of \(\mu\). These distributions “pretending that we know” are known as conditional distributions, and need to be known up to proportionality (this makes algebra less miserable). In our example they are:

\[p(\sigma_{e}^{2}|\mathbf{y},\mu)\]

\[p(\mu|\mathbf{y},\sigma_{e}^{2})\]

If these distributions are known, we can draw successive samples from them and then plug these samples into the right-hand side of the expressions, “as if” they were true, and iterate the procedure. So we start with, say, \(\text{mu} = 0\) and \(\sigma_{e}^{2} = 1\). Then we draw a new \(\mu\) from

\[p\left( \mu \middle| \mathbf{y},\sigma_{e}^{2} \right) = N\left( \widehat{\mu},lhs^{- 1} \right)\]

Then \(\sigma_{e}^{2}\) from

\[p\left( \sigma_{e}^{2} \middle| \mathbf{y,}\mu \right) = {\left( \mathbf{y - 1}\mu \right)^{'}\left( \mathbf{y - 1}\mu \right)\chi}_{k}^{- 2}\]

which is the conditional distribution assuming flat priors for \(\sigma_{e}^{2}\). Then we plug in this value into \(p\left( \mu \middle| \mathbf{y},\sigma_{e}^{2} \right)\) and we iterate the procedure. After a period, the samples so obtained are from the posterior distribution. Typically, thousands of iterates are needed, if not more. The following R code shows a simple simulated example.

set.seed(1234)
# simulated n data with mean 100 and residual variance 20
ndata=10
y=100+rnorm(ndata)*sqrt(20)
# Gibbs sampler
#initial values
mu=-1000
vare=10000
varmu=1000
#place to store samples
mus=c()
vares=c()
#sampling per se
for (i in 1:50){
  lhs=ndata/vare+1/varmu
  rhs=sum(y)/vare
  mu=rnorm(1,rhs/lhs,sqrt(1/lhs))
  vare=sum((y-mu)\*\*2)/rchisq(1,ndata)
  cat(mu,vare,\"\n")
  mus=c(mus,mu)
  vares=c(vares,vare)
}

The “beauty” of the system of inference is that we decompose a complex problem in smaller ones. For instance, variance component estimation proceeds by sampling breeding values (as in a BLUP “with noise”, Robin Thompson dixit), and then sampling variance components are estimated as if these EBV’s were true.

8.3 Post Gibbs analysis

A Gibbs sampler is not converging to any final value, like REML, in which each iterate is better than the precedent. Instead, at the end we have a collection of samples as follows:

Mu vare
38.47288 6832.21
76.12334 323.1892
85.76835 267.1094
91.08181 120.2974
100.1114 19.85989
98.52846 19.85005
98.03879 14.52127
97.54579 20.33205
98.10108 14.76999
99.39184 6.538137
96.90541 13.92563
...

and these samples define the posterior distribution of our estimator.

The first point is to verify that the chain has converged to the desired posterior distribution. Informal testing plots are very useful. For instance, plot(vares) in the above example shows that initial values of \(\sigma_{e}^{2}\) where out of the desired posterior distribution. We can discard some initial values and then keep the rest.

We need to report a final estimate, e.g., of \(\sigma_{e}^{2}\) from this collection of samples. Contrary to REML, the last sample of \(\sigma_{e}^{2}\) is not the most exact one, but is all the collection of samples which is of interest, because they approximate the posterior distribution of the estimator. So, a typical choice is the posterior mean, which is the average of the samples. In the example above, you can for instance discard the first 20 iterations as burn-in and then use the posterior mean across the last 30 samples of the residual variance:

mean(vares[21:50])
[1] 20.28395

which is very close to the simulated value of 20. The post-Gibbs analysis is clumsy but important and packages such as BOA exist in R to simplify things.

9 Models for genomic prediction

If SNP are just markers located outside genic regions, most of the times, why to use them? Because they may be linked to QTL or genes, fact that can be explained by an event called linkage disequilibrium (LD). The LD is based on expected versus observed allele frequencies and measures the non-random association of alleles across loci. We have seen LD before. The strength of the association between two loci is measured by the correlation. We assume that, if neighboring SNPs are tightly correlated, then QTLs that are “in the middle” should be strongly correlated as well (this might not be true – for instance if all QTLs have very low frequency, but that seems unlikely).

Instead of talking about association between loci, let’s assume we can use SNP to deduce the genotype of animals at each unobserved QTL. By having dense SNP panels (e.g., 50,000 SNP), it is more likely that QTL will be in LD with at least one SNP. If QTL A is linked to SNP B, depending on the strength of this linkage, once SNP B is observed it will imply QTL A was inherited together. In this way, genomic selection relies on the LD between SNPs and QTL, and although we do not observe the QTL, an indirect association between SNP and trait phenotype can be observed:

Indirect association QTL - markers - phenotype

The effectiveness of genomic selection can be predicted based on the proportion of variance on the trait the SNP can explain.

There are mainly two classes of methods for genomic selection:

  1. SNP effect-based method

  2. Genomic relationship-based method

For most of the livestock populations, the number of SNP is greater than the number of genotyped animals, which results in the famous “small n big p problem”. As the number of parameters is greater than the data points used for estimation, a solution is to assume SNP effects are random; in this way, all effects can be jointly estimated. To present the SNP effect-based method we will start with a single gene and we will move towards more of them.

9.1 Simple marker model

9.1.1 Multiallelic

Assume there is a marker in complete, or even incomplete, LD with a QTL. For example, the polymorphism in the halothane gene (HAL) is a predictor of bad meat quality in swine. The simplest way to fit this into a genetic evaluation is to estimate the effect of the marker by a linear model and least squares:

\[\mathbf{y = Xb + marker + e}\]

Where in “marker” we actually introduce a marker with alleles and their effects. More formally, allele effects are embedded in vector \(\mathbf{a}\) and their incidence matrix is in matrix \(\mathbf{Z}\):

\[\mathbf{y = Xb + Za + e}\]

For instance, assume that we have a four-allele \(\left\{ A,B,C,D \right\}\) locus and three individuals with genotypes \(\{\text{BC},\text{AA},\text{BD}\}\). Then

\[\mathbf{Za =}\begin{pmatrix} 0 & 1 & 1 & 0 \\ 2 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 \\ \end{pmatrix}\begin{pmatrix} a_{A} \\ a_{B} \\ a_{C} \\ a_{D} \\ \end{pmatrix}\]

Note that we have put a 2 for the genotype “AA”. This means that the effect of a double copy of “A” is twice that of a single copy. This is an additive model.

And for \(\mathbf{y} = \left\{ 12,35,6 \right\}\) this gives

\[\mathbf{y = Xb + Za + e}\]

\[\begin{pmatrix} 12 \\ 35 \\ 6 \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} 1 \\ 1 \\ 1 \\ \end{pmatrix}\mu\mathbf{+}\begin{pmatrix} 0 & 1 & 1 & 0 \\ 2 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 \\ \end{pmatrix}\begin{pmatrix} a_{A} \\ a_{B} \\ a_{C} \\ a_{D} \\ \end{pmatrix}\mathbf{+}\begin{pmatrix} e_{1} \\ e_{2} \\ e_{3} \\ \end{pmatrix}\]

9.1.2 Biallelic

Assume now that we do the same with a simple, biallelic marker (say \(\{ A,B\}\)). Consider three individuals with genotypes \(\{\text{BB},\text{AA},\text{BA}\}\):

\[\mathbf{Za =}\begin{pmatrix} 0 & 2 \\ 2 & 0 \\ 1 & 1 \\ \end{pmatrix}\begin{pmatrix} a_{A} \\ a_{B} \\ \end{pmatrix}\]

and

\[\begin{pmatrix} 12 \\ 35 \\ 6 \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} 1 \\ 1 \\ 1 \\ \end{pmatrix}\mu\mathbf{+}\begin{pmatrix} 0 & 2 \\ 2 & 0 \\ 1 & 1 \\ \end{pmatrix}\begin{pmatrix} a_{A} \\ a_{B} \\ \end{pmatrix}\mathbf{+}\begin{pmatrix} e_{1} \\ e_{2} \\ e_{3} \\ \end{pmatrix}\]

However, because there is redundancy (if the allele is not A, then it is B) it is mathematically equivalent to prepare a regression of the trait on the number of copies of a single allele, say A. Then \(\mathbf{Z}\) becomes a vector \(\mathbf{z}\) and the vector \(\begin{pmatrix} a_{A} \\ a_{B} \\ \end{pmatrix}\) becomes a scalar \(a_{A}\) . So for individuals \(\{ BB,AA,BA\}\) we have that

\[\mathbf{z}a\mathbf{=}\begin{pmatrix} 0 \\ 2 \\ 1 \\ \end{pmatrix}a_{A}\]

and

\[\begin{pmatrix} 12 \\ 35 \\ 6 \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} 1 \\ 1 \\ 1 \\ \end{pmatrix}\mu\mathbf{+}\begin{pmatrix} 0 \\ 2 \\ 1 \\ \end{pmatrix}a_{A}\mathbf{+}\begin{pmatrix} e_{1} \\ e_{2} \\ e_{3} \\ \end{pmatrix}\]

The effect of the marker can be estimated by least squares or another regression method. The marker should explain a large part of the variance explained by the gene. The model can be enriched by adding an extra polygenic term \(\mathbf{u}\), based on pedigree \(Var\left( \mathbf{u} \right)\mathbf{=}\mathbf{A}\sigma_{u}^{2}\), like for instance in

\[\mathbf{y = Xb + z}a\mathbf{+ Wu + e}\]

You may realize that this follows the chapter “Values and means” in (Falconer and Mackay 1996).

9.2 Why markers can’t be well chosen: lack of power and the Beavis effect

The method above can be potentially extended to more markers explaining the trait. However, the failure of this method resides in that we do not know which markers are associated to the trait. This is a very serious problem, because finding out which markers are linked to a trait generally induces lots of errors – and this because of the nature, and because of the Beavis effect.

Genetic background of complex traits seems to be highly complex and largely infinitesimal: many genes acting, possibly with interactions among them, to give the genetic determinism of one trait. Most of them bearing small effects, some may have large effects. Current alternatives for localization of genes include Genome-wide Association Study (GWAS). This consists in testing, one at a time, markers for its effect on a trait, mostly with a simple linear model as above. The procedure selects those markers with a significant effect after a statistical test, for instance a t-test. This test is usually corrected by Bonferroni to avoid spurious results. However, this way of proceeding leads to lack of power and bias. This will be shown next.

9.2.1 Lack of power

This is because a small effect can rarely be detected. The general formulae for power can be found in, e.g., (Luo 1998) and are implemented in R package ldDesign. A very simple version of the formulae for power where the causal variant is truly tagged by a marker is (I owe this expression to Anne Ricard)

\[power = 1 - \Phi\left( Z_{1 - \frac{\alpha}{2}}\ - \beta\sqrt{2pq\left( n - 2 \right)} \right)\]

with \(Z_{1 - \alpha/2}\) the rejection threshold, that is \(\approx 4.81\) after Bonferroni correction for 50,000 markers. For instance, in a population of n=1000 individuals, a QTL explaining 1% of the variance and perfectly tagged by a marker will be found 4% of the time. If 100 such QTLs exist in the population, only 4 of them will be found. The following Figure shows the power of detection of a QTL perfectly tagged explaining from 0 to 100% of the phenotypic variance.

Power of detection of QTL effects perfectly tagged explaining from zero to 100% phenotypic variance

9.2.2 The Beavis (or winner’s curse) effect

This comes as follows. We are mapping QTLs. To declare a QTL in a position, we perform a test (for example a t-test). This test depends on the estimated effect of the QTL, but

\[\text{estimated effect}\ = \text{real effect} + \text{"estimation noise"}\]

By keeping selected QTLs, we often keep large and positive noises. This is negligible if there were few QTLs with large effects but this is not the case. Large noises will occur in analysis with many markers, and this biases the estimated QTL effect, making it look much larger than real, in particular if they are small. The problem is exacerbated with GWAS approaches, because of testing many markers.

For instance, assume that a marker with allelic frequency \(p = 0.5\) truly explains 5% of the variance. Using formulae in Xu (2003), the variance explained by this marker will be overestimated and show up as 5.1% at regular type-I error. This does not change for more strict Bonferroni-like tests, e.g., \(\alpha = 0.05/50000\). However, for markers explaining 0.5% of the variance, the apparent variance explained is 0.9% (two times in excess) at \(\alpha = 0.05\) and a formidable 2.7% at \(\alpha = 0.05/50000\) (a 5-fold overestimation of the explained variance). Therefore, collecting 40 such significant markers may look like capturing all genetic variation whereas in fact they only capture 20% of the variance. The following R script allows these computations.

#beavis effect by Xu ,  2003, Genetics 165: 2259–2268
bias.beavis<- function(sigma2=1,n,p=.5,alpha,a){
        # this function computes real and apparent
        #(from QTL detection estimates) variance
        #explained by a biallelic QTL with effect a and
        # allelic frequency p at alpha risk
        #Andres Legarra, 7 March 2014
        gamma=2*p*(1-p)
        sigma2x=gamma
        eps1=-qnorm(1-alpha/2)-sqrt(n*gamma/sigma2)*a
        eps2= qnorm(1-alpha/2)-sqrt(n*gamma/sigma2)*a
        psi1=dnorm(eps1)/(1+pnorm(eps1)-pnorm(eps2))
        psi2=dnorm(eps2)/(1+pnorm(eps1)-pnorm(eps2))
        B=gamma*(sigma2/(n*sigma2x))*(1+eps2*psi1-eps1*psi2)
        var.explained=gamma*a**2
        var.attributed=var.explained+B
        att.over.exp=var.attributed/var.explained
        rel.var.explained=var.explained/sigma2
        rel.var.attributed=var.attributed/sigma2
        list(
                var.explained=var.explained,
                var.attributed=var.attributed,
                rel.var.explained=rel.var.explained,
          rel.var.attributed=rel.var.attributed,
                att.over.exp=att.over.exp
                        )
}        

The following graph shows the true variance explained by the QTL and the variance apparently explained by the QTL, for QTL effects ranging from 0 to 0.5 standard deviations, i.e. explain up to 12% of the variance. It can be seen that small effects are systematically exaggerated.

True (straight line) and apparent (dotted line) variance explained by QTL effects going from zero to 0.5 genetic standard deviations

The two following graphs, from very crude simulations, show both problems. The first one shows no bias, but the second shows, first, that only 3 out of 100 QTL were found (lack of power), and those 3 found are largely overestimated (Beavis effect).

Real (O) and estimated (*) effects after GWAS-like simulations with 10 true QTLs in 5000 markers, 1000 individuals.
Real (O) and estimated (*) effects after GWAS-like simulations with 100 true QTLs in 50000 markers, 1000 individuals

9.3 Fit all markers

Lande and Thompson (Lande and Thompson 1990) suggested getting the list of associated markers and their effects from an independent population. Whereas this is typically done -now- in human genetics, it seems impossible to do in agricultural populations. First, the associations are random, and therefore markers associated in one population are not necessarily associated in another one. Second, even the true list of acting genes and QTL will vary across populations due to drift or selection. One example is the bovine myostatin gene (GDF8), i.e. both the Belgian Blue and South Devon breeds carry the same GDF8 mutation, but they have different conformation and double-muscling phenotypes (Smith et al. 2000; Dunner et al. 2003).

These problems plague GWAS and QTL detection analysis. Further, nothing guarantees that markers with no effect at one stage will have no effect at another one, for instance, because of interactions. A simple way to avoid both the lack of power and the Beavis effect is not to use detection thresholds. Therefore, all markers are assumed to be QTL. This simple idea gave (T. H. E. Meuwissen, Hayes, and Goddard 2001) the key to attack the estimation of whole genetic value based on markers. First, markers with small effects will be included. Second, no bias will be induced due to the detection process.

Therefore, one should include all markers in genomic prediction. In a way, this makes sense because we use all information without discarding anything. But how is this doable? The simplest is to fit a linear model with the effects of all markers. Note that for this approach to work, you need to cover all the genome; many markers are needed.

Individual \(i\) has a breeding value \(u_{i}\). According to the previous paragraphs, we will try to predict the breeding value of an individual defined as a sum of marker effects \(a_{k}\) (there are \(m\) of them). An individual has genotypes coded in \(\mathbf{z}_{i}\), its breeding value is the sum of marker effects \(a_{k}\)weighted by the coefficients in \(\mathbf{z}_{i}\): \(u_{i} = \sum_{k = 1,m}^{}{z_{ik}a_{k} =}\mathbf{z}_{i}\mathbf{a}\). For all individuals this becomes \(\mathbf{u = Za}\).

9.3.1 Multiple marker regression as fixed effects

9.3.1.1 Multiallelic

The multiple marker regression is a simple extension of the single marker regression shown above. First, we construct a model were the phenotype is a function of all marker effects:

\[\mathbf{y}\mathbf{=}\mathbf{\text{Xb}}\mathbf{+}\mathbf{Za}\mathbf{+}\mathbf{e}\]

For instance, assume that we have a four-allele \(\left\{ A,B,C,D \right\}\) locus, another locus with alleles \(\{ E,F\}\) and three individuals with genotypes \(\{\text{BC}/\text{EE},\text{AA}/\text{EF},\text{BD}/\text{FF}\}\). Then

\[\mathbf{Za =}\begin{pmatrix} 0 & 1 & 1 & 0 & \vdots & 2 & 0 \\ 2 & 0 & 0 & 0 & \vdots & 1 & 1 \\ 0 & 1 & 0 & 1 & \vdots & 0 & 2 \\ \end{pmatrix}\begin{pmatrix} a_{A} \\ a_{B} \\ a_{C} \\ a_{D} \\ \cdots \\ a_{E} \\ a_{F} \\ \end{pmatrix}\]

9.3.1.2 Biallelic

With biallelic markers, we can reduce the number of unknowns to just one effect per marker – the effect of the reference allele. Assume now that we have just three individuals with two biallelic markers: a two-allele \(\left\{ A,B \right\}\) locus, and a two-allele locus with alleles \(\{ E,F\}\) and three individuals with genotypes \(\{\text{BA}/\text{EE},\text{AA}/\text{EF},\text{BB}/\text{FF}\}\). If we fit one effect per allele the system of equation is:

\[\mathbf{Za =}\begin{pmatrix} 1 & 1 & \vdots & 2 & 0 \\ 2 & 0 & \vdots & 1 & 1 \\ 0 & 2 & \vdots & 0 & 2 \\ \end{pmatrix}\begin{pmatrix} a_{A} \\ a_{B} \\ \cdots \\ a_{E} \\ a_{F} \\ \end{pmatrix}\]

And if we reduce the effects to one effect per marker, we get

\[\mathbf{Za =}\begin{pmatrix} 1 & \vdots & 2 \\ 2 & \vdots & 1 \\ 0 & \vdots & 0 \\ \end{pmatrix}\begin{pmatrix} a_{A} \\ \cdots \\ a_{E} \\ \end{pmatrix}\]

Again, estimation of \(\mathbf{a}\) can proceed by least squares.

9.3.1.3 Massive number of markers

Imagine that we have 20 markers and 3 individuals, matrix \(\mathbf{Z}\) looks like:

1 1 2 2 1 0 0 1 0 0 2 0 0 2 0 2 2 0 1 1

0 1 2 1 2 1 0 1 2 2 2 2 0 2 1 0 1 0 0 1

2 0 2 0 0 2 1 0 0 0 1 1 0 2 2 1 0 0 0 1

But SNP chips yield thousands of markers. This poses two kinds of problems. The first one is practical: we can’t (reliably) estimate 50,000 effects from, say, 1,000 data in \(\mathbf{y}\). The second is conceptual: does it make sense to estimate all these marker effects without imposing any constraints? In fact, one should not expect that a marker has a large effect; rather, we expect them to be restricted to plausible values. For instance, a marker should not have an effect of, say, one phenotypic standard deviation of the trait. In a way, this is an “a priori” information and there must be a way to introduce this information. But this introduces a very old subject of genetic evaluation: prediction. After explaining prediction, we will go back to models.

9.4 Bayesian Estimation, or Best Prediction, of marker effects

Marker effects can be considered as the result of random processes, because they are the result of random buildup of linkage disequilibrium, random generation of alleles at genes, and so on. Therefore, they have (or may have) an associated distribution (whether you call this a sampling distribution or a prior distribution is largely a matter of taste). I will generally call this prior information. It is well known (Casella and Berger 1990) that accurate prediction of random effects involves integration of all information, prior information and observed information, that in our case it comes in the form of observed phenotypes.

If we call \(\mathbf{a}\) the marker effects, and \(\mathbf{y}\) the data, the Posterior Mean, or Conditional Expectation of (estimators of) marker effects is given by the expression

\[\widehat{\mathbf{a}}\mathbf{=}E\left( \mathbf{a}\left| \mathbf{y} \right.\ \right)\mathbf{=}\frac{\int_{}^{}{\mathbf{\text{a\ }}p\left( \mathbf{y} \mid \mathbf{a} \right)p\left( \mathbf{a} \right)d\mathbf{a}}}{\int_{}^{}{\mathbf{\ }p\left( \mathbf{y} \mid \mathbf{a} \right)p\left( \mathbf{a} \right)d\mathbf{a}}}\]

We have already discussed the Posterior Mean in the introduction to Bayesian inference. This is often called as Best Prediction, because in a Frequentist context it does minimize, over conceptual repetitions of the procedure, the distance between “true” \(\mathbf{a}\) and its estimator, \(\widehat{\mathbf{a}}\) (Casella and Berger 1990). On the other hand, this can be seen as a Bayesian estimator as described above. This estimator has an extraordinary advantage over the regular least squares, because it uses all available information (D. Gianola and Fernando 1986). Further, it has been proven that Best Predictors are optimal for selection (Cochran 1951 ; R. Fernando and Gianola 1986; Goffinet and Elsen 1984). The introduction of the prior distribution \(p\left( \mathbf{a} \right)\) has an effect of “regressing” the estimators towards the a priori values, a process that is known as shrinkage. Therefore, the Best Predictors are “shrunken” or “regressed” estimators.

In the context of genomic predictions, the Best Predictor is composed of two parts:

  1. The prior distribution of marker effects \(p\left( \mathbf{a} \right)\)

  2. The likelihood of the data given the marker effects, \(p\left( \mathbf{y} \mid \mathbf{a} \right)\)

Breeders have a fairly decent idea of how to write the latter, \(p\left( \mathbf{y} \mid \mathbf{a} \right)\). Most often this is written as a normal likelihood, of the form

\[p\left( \mathbf{y} \mid \mathbf{a} \right) = MVN\left( \mathbf{Xb + Za,R} \right)\]

where matrix \(\mathbf{R}\) contains residual covariances. The model may include further linear terms such as pedigree-based covariances, permanent effects, and so on. However, how to write down the prior distribution \(p\left( \mathbf{a} \right)\) is far from being clear, and this has been the subject of frantic research during the last decade. This will be part of the subject of the following sections.

9.4.1 Best Predictions as a regularized estimator

Regularized predictors are much used now in Statistics. They are composed of two parts: a likelihood, and a regularization function which prevents the estimators from going “too far away”. For instance, the regular Lasso (Tibshirani 1996) can be understood as an estimator that uses a likelihood as above, combined with the restriction \(\left| \mathbf{a} \right| < \lambda\). Another example is the Ridge Regression, where there is a penalty function of \(a_{i}^{2}\) – the larger the square of the effect, the more penalized. The explanation of these estimators is largely practical. However, from the point of view of a Bayesian or a Frequentist (or an animal breeder), they are Bayesian (or Best Predictor) estimators with particular sampling or a priori distributions. For instance, the Lasso assumes that (marker) effects are a priori distributed following a Laplace (double exponential) distribution, and Ridge Regression assumes that effects are a priori normally distributed. A, by and large, advantage of this understanding is that it allows the connection between classical quantitative genetics theory and prior distributions for marker effects.

9.5 The ideal process for genomic prediction

We have prepared the conceptual setup. The process of genomic prediction consists in estimating marker effects using the Conditional Mean of marker effects as above, which is based on phenotypes at the trait(s) of interest and the prior distribution of marker effects. This creates a prediction equation which can be summarized as something like:

A form of prediction equation.
Locus Allele Effects estimates
1 A +10
2 E +5

For the i-th individual, the product of its genotype (the i-th row, \(\mathbf{z}_{i}\) of matrix \(\mathbf{Z}\)) and the alleles’ effects (in \(\widehat{\mathbf{a}}\)) gives a genomic estimated breeding value, say \({\widehat{u}}_{i} = \mathbf{z}_{i}\widehat{\mathbf{a}}\). This applies equally well to animals with or without phenotype. The next section of these notes will describe how this can be accomplished through the so-called Bayesian regressions.

Process of genomic prediction

Process of genomic prediction (r: references; c: candidates)

10 SNP effect-based methods: SNP-BLUP and Bayesian regressions

For most of the livestock populations, the number of SNP is greater than the number of genotyped animals, which results in the famous “small n big p problem”. As the number of parameters is greater than the data points used for estimation, a solution is to assume SNP effects are random (or that they have a prior distribution); in this way, all effects can be jointly estimated. Even if the number of genotyped animals is large, still it makes sense to fit SNPs as random, because the prior information says that small effects are frequent and large effects are unlikely.

Bayesian regression is another name for the Best Predictor or Conditional Expectation described above, and it describes the fact that we compute Conditional Expectations (another name for regressions (Casella and Berger 1990) ) using Bayesian methods. The term was first introduced in the genomic prediction literature by (Campos et al. 2009) and it is being used since. The Bayesian regression is, as described above, composed of a likelihood \(p\left( \mathbf{y} \mid \mathbf{a} \right) = \text{MVN}\left( \mathbf{Xb}+\mathbf{Za},\mathbf{R} \right)\) and a prior distribution for markers, \(p(\mathbf{a})\). A full and comprehensive account of Bayesians regressions for genomic prediction is in (Campos et al. 2013). However, before presenting the different models for Bayesian regressions, we will detail how allele coding should proceed in these methods.

10.1 Allele coding.

Allele coding is the assignment of genotypes to numerical values in matrix \(\mathbf{Z}\). Strandén and Christensen (2011) studied this in some detail. Markers commonly used for genomic prediction are biallelic markers. Imagine four individuals and two loci, where alleles for the loci are \(\left\{ A,a \right\}\) and \(\left\{ B,b \right\}\). The genotypes of the four individuals are:

\[\begin{matrix} \text{aa} & \text{Bb} \\ \text{AA} & \text{bb} \\ \text{Aa} & \text{bb} \\ \text{aa} & \text{bb} \\ \end{matrix}\]

This can be coded with one effect by allele:

\[\mathbf{Za =}\begin{pmatrix} 0 & 2 & \vdots & 1 & 1 \\ 2 & 0 & \vdots & 0 & 2 \\ 1 & 1 & \vdots & 0 & 2 \\ 0 & 2 & \vdots & 0 & 2 \\ \end{pmatrix}\begin{pmatrix} a_{1A} \\ a_{1a} \\ \cdots\ \\ a_{2B} \\ a_{2b} \\ \end{pmatrix}\]

where \(a_{2B}\) is the allele “B” of the 2nd loci. So, for \(n\) markers we have \(2n\) effects. Classic theory (e.g. (Falconer and Mackay 1996) shows that this can be reduced to one effect by locus. We code in an additive way, as a regression of genetic value on gene content. The three classical ways of coding are:

Additive coding for marker effects at locus i with reference allele \(A\)
Genotype 101 Coding 012 Coding Centered coding
aa \(-a_{i}\) \(0\) \(- 2p_{i}a_{i}\)
Aa \(0\) \(a_{i}\) \((1 - 2p_{i})a_{i}\)
AA \(a_{i}\) \(2a_{i}\) \(\left( 2 - 2p_{i} \right)a_{i}\)

where \(p_{i}\) is the frequency of the reference allele (“A” in this case) at the i-th locus. In the example above, we have three possible \(\mathbf{Z}\) matrices:

101 coding: \(\mathbf{Za}=\begin{pmatrix} -1 & 0 \\ 1 & -1 \\ 0 & -1 \\ -1 & -1 \\ \end{pmatrix}\begin{pmatrix} a_{1} \\ a_{2} \\ \end{pmatrix}\)

012 coding: \(\mathbf{Za}=\begin{pmatrix} 0 & 1 \\ 2 & 0 \\ 1 & 0 \\ 0 & 0 \\ \end{pmatrix}\begin{pmatrix} a_{1} \\ a_{2} \\ \end{pmatrix}\)

centered coding: \(\mathbf{Za}\mathbf{=}\begin{pmatrix} -0.75 & 0.75 \\ 1.25 & - 0.25 \\ 0.25 & - 0.25 \\ -0.75 & - 0.25 \\ \end{pmatrix}\begin{pmatrix} a_{1} \\ a_{2} \\ \end{pmatrix}\)

for the “centered” coding, allelic frequencies where \(0.375\) and \(0.125\); it can be verified that each column of centered \(\mathbf{Z}\) sums to 0. This will be true if allelic frequencies are computed from observed data. VanRaden (P. M. VanRaden 2008) defined matrix \(\mathbf{M}\) as \(\mathbf{Z}\) with 101 coding and then \(\mathbf{Z}\mathbf{=}\mathbf{M}\mathbf{-}\mathbf{P}\), where \(\mathbf{P}\) is a matrix with \(2(p_{i} - 0.5)\) or \(\mathbf{P}=\mathbf{2p'}\).

Which allele to pick as a reference is arbitrary. If the other allele is chosen (as in the next Table), then the numbers in \(\mathbf{Z}\) are reversed.

Additive coding for marker effects at locus i with reference allele \(a\).
Genotype 101 Coding 012 Coding Centered coding
aa \[a_{i}\] \[2a_{i}\] \[\left( 2 - 2p_{i} \right)a_{i}\]
Aa \[0\] \[a_{i}\] \[(1 - 2p_{i})a_{i}\]
AA \[{- a}_{i}\] \[0\] \[- 2p_{i}a_{i}\]

As a result, estimates for marker effects \(a_{i}\) will change sign but the absolute value will be the same. Hence, \(\mathbf{u}=\mathbf{Za}\) will be the same regardless of the coding.

The following Julia code illustrates this

# test SNP-BLUP with different allele frequencies
using Random
nsnp=10
nanim=4
vary=10
h2=0.3
vare=(1-h2)*vary
varu=(h2)*vary


function SNP_BLUP(X,Z,y,vare,vara)
    RHS=[X Z]'*y/vare
    LHS=[X Z]'*[X Z]/vare
    #display(LHS)
    LHS[(1+1):(1+nsnp),(1+1):(1+nsnp)] += I/vara
    sol=LHS\RHS
    return sol
end

Random.seed!(1234)
p=rand(Beta(2,2),nsnp)
Z=zeros(nanim,nsnp)
for i in 1:nsnp
    for j in 1:nanim
        Z[j,i]=Float64.(rand(Binomial(2,p[i]),1))[1]
    end
end    

# sample y (not from an actual genetic theory - this is just
# white noise)

y=rand(Normal(0,vary),nanim)

# we use the same variance component for all SNP-BLUPd
# and it is actually neither of the allele frequencies, the sampled or the observed
vara=varu/(2*nsnp*0.5*0.5)

# compute observed frequencies (result of sum is a row vector, we cobvert to 
#column vector)
pobs=(sum(Z,dims=1)/(nanim*2))'

# use observed frequencies
unos=ones(nanim)
Zstar= Z - 2*unos*pobs'
# mean
X=unos

sol_obs=SNP_BLUP(X,Zstar,y,vare,vara)
display(sol_obs)

# use drawn frequencies
Zstar= Z - 2*unos*p'
sol_drawn=SNP_BLUP(X,Zstar,y,vare,vara)
display(sol_drawn)

display(isapprox(sol_obs[1+1:1+nsnp] , sol_drawn[1+1:1+nsnp]))

# use .5 frequencies
Zstar= Z .- 1
sol_05=SNP_BLUP(X,Zstar,y,vare,vara)
display(sol_05)

display(isapprox(sol_obs[1+1:1+nsnp] , sol_05[1+1:1+nsnp]))

# use stupid frequencies
pstupid=rand(Normal(0,1),nsnp)
Zstar= Z - 2*unos*pstupid'
sol_stupid=SNP_BLUP(X,Zstar,y,vare,vara)
display(sol_stupid)
display(isapprox(sol_obs[1+1:1+nsnp] , sol_stupid[1+1:1+nsnp]))

10.2 Effect of prior information on marker estimates

Bayesian regressions are affected by the prior distribution that we assign to marker effects. One of the concerns is to be “fair” about the prior distribution when making predictions. The problem is that the marker effect can be either too much shrunken (so that its estimate is too small, for instance if there is a major gene) or too little shrunken, in which case the estimate of the marker contains too much error and is completely wrong. Consider one marker. We have a likelihood information for this marker (its effect on the trait) and a prior information from “outside”. What happens if this prior information is wrong?

The following two examples illustrate this. In both cases we estimate the marker effect as

\[lhs = \frac{\mathbf{1'1}}{\sigma_{e}^{2}} + \frac{1}{\sigma_{a}^{2}}\]

\[\widehat{a}\ = lhs^{- 1} \mathbf{1'y}/\sigma_{e}^{2}\]

10.2.1 Marker effect is fixed

Assume that we have \(10\) records, and the marker has a “true” effect of 0.2, and this effect is constant across replicates. For instance, DGAT1 is a known gene, and it is hard to think that its effect would change across different Holstein populations. We assume different prior variances for the marker, \(\sigma_{a}^{2} = \{ 0.01,\ 0.1,\ 1,10,100\}\), and \(\sigma_{e}^{2} = 1\). We have simulated 1000 data sets, and estimated the marker effect for each replicate; then plotted in the next Figure the error (as a boxplot) against the “no error” (in red), for each assumed marker variance.

It can be seen that when \(\sigma_{a}^{2}\) is “large” the estimator is unbiased (on average there is no error) but each individual estimate has very large error (for instance there are errors of 4). When some shrinkage is used (i.e., for \(\sigma_{a}^{2} = 1\)) the effect is slightly underestimated but large exaggerations never happen. Thus, across repetitions, the mean square error (blue stars) is minimized for small values of assumed \(\sigma_{a}^{2}\).

10.2.2 Marker effect is random

In this case, the marker has different effects across populations because it is on feeble LD with some QTL. Then the “true” effect of the marker may change all the time, because at each generation LD will be different. Thus, we can say that the marker effect is random and it comes from some distribution. If the true variance of the marker effect is \(\sigma_{a}^{2} = 1\), we obtain the results on the bottom of the Figure. All methods are unbiased (there is no systematic error) but putting the right variance give us the minimum error, as seen by the blue stars.

10.3 Genetic variance explained by markers

A population of \(n\) individuals has different breeding values \(u_{1}\ldots u_{n}\) . These individuals have a certain genetic variance \(Var\left( u \right) = \sigma_{u}^{2}\). If markers are genes: which part of the genetic variance is explained by each marker? This is just basic quantitative genetics. If a marker has an effect of \(a_{i}\) for each copy of the \(A\) allele, we have \(p^{2}\) individuals with a value of \(u = + 2a_{i}\), \(q^{2}\) individuals with a value of \(u = 0\), and \(2\text{pq}\) individuals with a value of \(u = a_{i}\) . Then the variance explained by this marker is \(Var\left( u \right) = E\left( u^{2} \right) - E\left( u \right)^{2}\) , which is developed in the following Table

Variance explained by one marker
Genotype Frequency \(u^2\) \[u\]
AA \(p^2\) \(4a_i^2\) \(2a_i\)
Aa \(2pq\) \(a_i^2\) \(a_i\)
aa \(q^2\) \(0\) \(0\)
Average \(4p^2 a_i^2 +2pq a_i^2\) \(2p a_i\)

So, finally the variance explained by one marker is \(4p^{2}a_{i}^{2} + 2pq a_{i}^{2} - \left( 2p a_{i} \right)^{2} = 2pq a_{i}^{2}\). Markers with intermediate frequencies will explain most genetic variation. This is one of the reasons to ignore markers with low allele frequency.

10.3.1 Total genetic variance explained by markers

These are classic results also. Consider two markers, and consider that we know their effects \(a_{i}\). The genetic value of an individual with genotype \(\mathbf{z}\) will be\(\ u = z_{1}a_{1} + z_{2}a_{2}\) . Variance in the population comes from sampling of genotypes (i.e., some individuals have one genotype while others have another genotype). Then \(Var(u)=Var(z_1)a_{1}^{2} + Var(z_2)a_2^2 + 2Cov(z_1,z_2)a_1a_2\). The term \(Var\left( z_{1} \right) = 2p_{1}q_{1}\). The term \(Cov\left( z_{1},z_{2} \right)\) turns out to be \((z_{1},z_{2}) = 2r\sqrt{p_{1}q_{1}p_{2}q_{2}}\) , where \(r\) is the correlation measuring linkage disequilibrium. The term \(a_{1}a_{2}\) implies that marker effects go in the same direction. Therefore, for the covariance between loci to enter into the genetic variance, the two markers need to be on linkage disequilibrium and at the same time their effects need to point in the same direction. Although Bulmer effect generates, in selected populations, linkage disequilibrium, we will ignore it here; in this case, on average this term will typically cancel out.

Either assuming linkage equilibrium or assuming that markers are uncorrelated one to each other, then, \(Var\left( u \right) = Var\left( z_{1} \right)a_{1}^{2} + Var\left( z_{2} \right)a_{2}^{2} = 2p_{1}q_{1}a_{1}^{2} + 2p_{2}q_{2}a_{2}^{2}\), and variances of each marker can simply be added. If we generalize this result to many markers, we have that

\[\sigma_{u}^{2} = Var\left( u \right) = 2\sum_{i}^{\text{nsnp}}{p_{i}q_{i}a_{i}^{2}}\]

However, in most cases we do not know the marker effects. We may, though, have some prior information on them, like their a priori variance (the a priori mean is usually taken as zero). If this is the case, then we can substitute the term \(a_{i}^{2}\) by its a priori expectation, that is, \(\sigma_{\text{ai}}^{2}\) and therefore: \(\sigma_{u}^{2} = Var(u) = 2\sum_{i}^{\text{nsnp}}{p_{i}q_{i}\sigma_{\text{ai}}^{2}}\).

If we assume that all markers have the same variance a priori \(\sigma_{\text{ao}}^{2}\) (say \(\sigma_{a1}^{2} = \sigma_{a2}^{2} = \sigma_{a3}^{2} = \ldots = \sigma_{a0}^{2}\), then \(\sigma_{u}^{2} = 2\sum_{i}^{\text{nsnp}}{p_{i}q_{i}\sigma_{0\ }^{2}} = 2\sigma_{0\ }^{2}\sum_{i}^{\text{nsnp}}{p_{i}q_{i}}\) .We can factor out \(\sigma_{\text{ao}}^{2}\) and we have the famous identity (Daniel Gianola et al. 2009; R. L. Fernando et al. 2007; P. M. VanRaden 2008; D. Habier, Fernando, and Dekkers 2007 ):

\[\sigma_{a0}^{2} = \frac{\sigma_{u}^{2}}{2\sum_{i}^{\text{nsnp}}{p_{i}q_{i}}}\]

This puts the a priori variance of the markers as a function of the genetic variance of the population. This result is used over and over in these notes and in most applications in genomic prediction.

10.3.2 Genetic variance explained by markers after fitting the data

This is actually fairly simple. After fitting the model to the data, there is an estimate \(\widehat{a}\) for each marker. We may say that each marker \(i\) explains a variance \(2p_{i}q_{i}{\widehat{a}}_{i}^{2}\). Therefore, and contrary to common assertions, the genetic variance contributed by each marker is NOT the same across all markers, and this is true for any method. Also, note that \(2\sum p_{i}q_{i}{\widehat{a}}_{i}^{2}\) underestimates the total genetic variance, because estimates \({\widehat{a}}_{i}\) are shrunken towards 0. Better estimators will be presented later in, among others, GREML and BayesC.

10.4 Prior distributions for marker effects

From previous sections, it is clear that shrinking or, in other words, use of prior distributions for markers is a good idea. Therefore, we need a prior distribution for marker effects, which is notoriously difficult to conceive. Complexity comes, first, because markers are not genes per se, rather, they tag genes. But even the distribution of gene effects is unknown. There is a growing consensus in that most complex traits are highly polygenic, with hundreds to thousands of causal genes, most frequently of small effect. So, the prior distribution must include many small and few large effects. Also, for practical reasons, markers are assumed to be uncorrelated – even if they are close. For instance, if two markers are in strong linkage disequilibrium, they will likely show a similar effect after fitting the model, because they will have similar incidence matrices in \(\mathbf{Z}\). But before fitting the model, we cannot say that their effects will be similar or not. This is even exaggerated because there is arbitrariness in defining the sense of the coding; naming “A” or “a” the reference allele will change the sign of the marker effect.

Many priors for marker effects have been proposed in the last years. These priors come more from practical (ease of computation) than from biological reasons. Each prior originates a method or family of methods, and we will describe them next, as well as their implications.

  1. Normal distribution: Random regression BLUP (RR-BLUP), SNP-BLUP, GBLUP

  2. Normal distribution with unknown variances: BayesC, GREML, GGibbs

  3. Student (t) distribution : BayesA

  4. Mixture of Student (t) distribution and spike at 0: BayesB

  5. Mixture of Normal distribution and spike at 0: BayesCPi

  6. Double exponential: Bayesian Lasso

  7. Mixture of a large and small normal distribution: Stochastic Search Variable Selection (SSVS)

10.5 RR-BLUP or SNP-BLUP

In these notes, I will keep the name GBLUP for the model using genomic relationship matrices that will appear later, and the name SNP-BLUP for estimating marker effects.

The SNP-BLUP model for the phenotypes is typically something like:

\[\mathbf{u = Xb + Za + e}\]

with \(\mathbf{b}\) fixed effects (i.e., an overall mean), \(\mathbf{a}\) marker effects, and \(\mathbf{e}\) residual terms, with \(Var\left( \mathbf{e} \right) = \mathbf{R}\) and usually \(\mathbf{R}\mathbf{=}\mathbf{I}\sigma_{e}^{2}\). Matrix \(\mathbf{Z}\) contains genotypes coded in any of the forms that we have described previously (usually centered, 012 or 101).

The prior for markers can be written as:

\[p\left( \mathbf{a} \right) = \prod_{i = 1,nsnp}^{}{p(a_{i})}\]

where

\[p\left( a_{i} \right) = N\left( 0,\sigma_{a0}^{2} \right)\]

each marker effect follows a priori a normal distribution with a variance \(\sigma_{a0}^{2}\) (that we will term hereinafter “variance of marker effects”). Note that the “0” implies that this variance is constant across markers.

Standard normal distribution

From the Figure, it can be remarked that in a normal distribution most effects are concentrated around 0, whereas few effects will be larger than, say, a value of 3. Therefore, the prior assumption of normality precludes few markers of having very large effects – unless there is a lot of information to compensate for this prior information.

We assume that markers are independent one from each other. This can be equivalently written as:

\[p(\mathbf{a}) = MVN(\mathbf{0,D});\ Var(\mathbf{a}) = \mathbf{D} = \mathbf{I}\sigma_{a0}^{2}\]

where MVN stands for multivariate normal. This formulation including \(\mathbf{D}\) will be used again throughout these notes.

10.5.1 Mixed Model equations for SNP-BLUP

10.5.1.1 Single trait

The great advantage of the normal distribution is its algebraic easiness. Whereas in most cases marker effects are estimated using Gibbs Sampling, as we will see later on, there are closed formulae for estimators of marker effects. We can use Henderson’s Mixed Model Equations:

\[\begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{\text{\ X}} & \mathbf{X}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{ Z} \\ \mathbf{Z}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{X} & \mathbf{Z}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{\ Z +}\mathbf{D}^{- 1}\ \\ \end{pmatrix}\begin{pmatrix} \widehat{\mathbf{b}} \\ \widehat{\mathbf{a}} \\ \end{pmatrix} = \begin{pmatrix} \mathbf{X}^{'}\mathbf{R}^{- 1}\mathbf{y} \\ \mathbf{Z}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{y} \\ \end{pmatrix}\]

Note that this is a linear estimator. If \(Var(\mathbf{a}) = \mathbf{D} = \mathbf{I}\sigma_{a0}^{2}\) and \(Var\left( \mathbf{e} \right) = \mathbf{R} = \mathbf{I}\sigma_{e}^{2}\), then we can simplify them to

\[\begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{X} & \mathbf{X}^{\mathbf{'}}\mathbf{Z} \\ \mathbf{Z}^{\mathbf{'}}\mathbf{X} & \mathbf{Z}^{\mathbf{'}}\mathbf{Z + I\lambda} \\ \end{pmatrix}\begin{pmatrix} \widehat{\mathbf{b}} \\ \widehat{\mathbf{a}} \\ \end{pmatrix} = \begin{pmatrix} \mathbf{X}^{'}\mathbf{y} \\ \mathbf{Z}^{\mathbf{'}}\mathbf{y} \\ \end{pmatrix}\]

with \(\lambda = \sigma_{e}^{2}/\sigma_{a0}^{2}\) . This expression is also known as Ridge Regression, although the Ridge Regression literature presents \(\mathbf{I}\lambda\) (or \(\mathbf{D}\)) merely as a computational device to warrant correct estimates, and genetics literature presents \(\lambda\) as the ratio of residual to genetic variances. (We don’t like the name Ridge Regression for this reason). Following traditional notations, we will talk about \(lhs\) (left hand side of the equations) and \(rhs\) (right hand side): \(lhs\begin{pmatrix} \widehat{\mathbf{b}} \\ \widehat{\mathbf{a}} \\ \end{pmatrix} = rhs\) .

These equations have unusual features compared to regular ones. First, the dimension is \(\left( \text{number}\ \text{of}\ \text{fixed}\ \text{effects} + \text{number}\ \text{of}\ \text{markers} \right)^{2}\) but does not depend on the number of animals. Second, they are very little sparse. Matrix \(\mathbf{Z}^{\mathbf{'}}\mathbf{Z}\) is completely dense and full.

For instance, assume \(\mathbf{Za}\mathbf{=}\begin{pmatrix} - 1 & 0 \\ 1 & - 1 \\ 0 & - 1 \\ - 1 & - 1 \\ \end{pmatrix}\begin{pmatrix} a_{1} \\ a_{2} \\ \end{pmatrix}\) (four individuals and two markers), an overall mean and \(\lambda = 0.5\). Then

\[\begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{X} & \mathbf{X}^{\mathbf{'}}\mathbf{Z} \\ \mathbf{Z}^{\mathbf{'}}\mathbf{X} & \mathbf{Z}^{\mathbf{'}}\mathbf{Z + I}\lambda \\ \end{pmatrix} = \begin{pmatrix} 4 & - 1 & - 3 \\ - 1 & 3 + 0.5 & 0 \\ - 3 & 0 & 3 + 0.5 \\ \end{pmatrix}\]

10.5.1.2 Multiple trait

For a multiple trait model, the equations are as above:

\[\begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{\text{\ X}} & \mathbf{X}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{Z} \\ \mathbf{Z}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{X} & \mathbf{Z}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{\ Z +}\mathbf{D}^{- 1}\ \\ \end{pmatrix}\begin{pmatrix} \widehat{\mathbf{b}} \\ \widehat{\mathbf{a}} \\ \end{pmatrix} = \begin{pmatrix} \mathbf{X}^{'}\mathbf{R}^{- 1}\mathbf{y} \\ \mathbf{Z}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{y} \\ \end{pmatrix}\]

but \(\mathbf{R}\) and \(\mathbf{D}\) include multiple trait covariances, e.g. \(\mathbf{R}\mathbf{=}\mathbf{I} \otimes \mathbf{R}_{0}\) and \(\mathbf{D} = \mathbf{I} \otimes \mathbf{S}_{a0}\).

10.5.2 How to set variance components in BLUP-SNP

Henderson’s equations assume that you know the values of two variance components, the variance of marker effects (\(\sigma_{a}^{2}\)), and the residual variance (\(\sigma_{e}^{2}\)). There are two possible strategies. The most common one is to use the relationship between the genetic variance and the a priori marker variance and to use

\[\sigma_{a0}^{2} = \frac{\sigma_{u}^{2}}{2\sum_{i}^{\text{nsnp}}{p_{i}q_{i}}}\]

where \(\sigma_{u}^{2}\) is an estimate of the genetic variance (e.g., obtained from previous pedigree-based studies) and \(p\) are marker frequencies (\(q = 1 - p\)). These allelic frequencies should be the ones in the population where the genetic variance was estimated (e.g., the base population of the pedigree) and not the current, observed populations. However, \(p\) are usually obtained from the data, so there is some error (although often negligible) and we will come back to this later. Also, sometimes the genetic variance that is used in the “large” (national) genetic evaluations does not match well the genetic variance existing in the population with genotypes. But the equation

\(\sigma_{a0}^{2} = \frac{\sigma_{u}^{2}}{2\sum_{i}^{\text{nsnp}}{p_{i}q_{i}}}\) is usually a good guess.

As for the residual variance, it can be taken as well from previous studies.

For the multiple trait case, \(\mathbf{S}_{a0} = \mathbf{G}_{0}/2\sum_{i}^{\text{nsnp}}{p_{i}q_{i}}\) where \(\mathbf{G}_{0}\) is a matrix with estimates of the genetic covariances across traits.

10.5.3 Solving for marker effects

Mixed model equations as above can be explicitly setup and solved but this is expensive. For instance, setting up the equations would have a cost of \(n^{2}\)markers times \(m\)individuals, and inverting them of \(n^{3}\). Alternative strategies exist (A. Legarra and Misztal 2008; P. M. VanRaden 2008; I. Strandén and Garrick 2009). They involve working with genotype matrix \(\mathbf{Z}\) without setting up explicitely the mixed model equations. This can be done using iterative solving, where new solutions are based on old ones, and as iteration proceeds they are better and better until we can stop iterating. Two such procedures are the Gauss Seidel and the Preconditioned Conjugated Gradients Algorithm or PCG. These were explained in detail by (A. Legarra and Misztal 2008).

Gauss Seidel proceeds to solve each unknown pretending than the other ones are known. So, if we deal with the \(i\)-th marker at iteration \(l + 1\), the mixed model equations for that marker reduce to a single equation:

\[\left( \mathbf{z}_{i}^{'}\mathbf{z}_{i}\ + \ \lambda \right){\widehat{a}}_{i}^{l + 1}\ = \ \mathbf{z}_{i}^{'}\ (\mathbf{y} - \mathbf{X}\widehat{\mathbf{b}}\mathbf{- Z}\widehat{\mathbf{a}} + \mathbf{z}_{i}{\widehat{a}}_{i}^{l})\]

This needs \(n\) operations for each marker, with a total of \(n^{2}\) operations for each complete round of the Gibbs Seidel (e.g., \(50000^{2}\) for a 50K chip). However, it is easy to realize that the term within the parenthesis is the residual term “so far”, \({\widehat{\mathbf{e}}}^{l}\):

\[\left( \mathbf{z}_{i}^{'}\mathbf{z}_{i}\ + \ \lambda \right){\widehat{a}}_{i}^{l + 1}\ = \ \mathbf{z}_{i}^{'}\ \left( {\widehat{\mathbf{e}}}^{l} + \mathbf{z}_{i}{\widehat{a}}_{i}^{l} \right) = \mathbf{z}_{i}^{'}{\widehat{\mathbf{e}}}^{l}\ + \mathbf{z}_{i}^{\mathbf{'}}\mathbf{z}_{i}{\widehat{a}}_{i}^{l}\]

So the operation can be changed to a simpler one with a cost of \(n\). The error term needs to be corrected after every new solution of the marker effect, using

\[{\widehat{\mathbf{e}}}^{l + 1}\ = {\widehat{\mathbf{e}}}^{l}\ -\ \mathbf{z}_{i}\left( {\widehat{a}}_{i}^{l + 1} - {\widehat{a}}_{i}^{l} \right)\]

With a cost of \(m\) (number of records) for each marker, and \(\text{mn}\) for a complete iteration. This strategy is called Gauss Seidel with Residual Update . A pseudo code in Fortran follows; a working code in R is at the Appendix:

Double precision:: xpx(neq),y(ndata),e(ndata),X(ndata,neq), &
sol(neq),lambda,lhs,rhs,val
do i=1,neq
    xpx(i)=dot_product(X(:,i),X(:,i)) !form diagonal of X'X
enddo
e=y
do until convergence
    do i=1,neq
        !form lhs X’R-1X + G-1
        lhs=xpx(i)/vare+1/vara
        ! form rhs with y corrected by other effects (formula 1) !X’R-1y
        rhs=dot_product(X(:,i),e)/vare +xpx(i) *sol(i)/vare
        ! do Gauss Seidel
        val=rhs/lhs
        ! MCMC sample solution from its conditional (commented out here)
        ! val=normal(rhs/lhs,1d0/lhs)
        ! update e with current estimate (formula 2)
        e=e - X(:,i)*(val-sol(i))
        !update sol
        sol(i)=val
    enddo
enddo

PCG is a strategy that uses a generic solver and proceeds by successive computations of the product \(\begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{X} & \mathbf{X}^{\mathbf{'}}\mathbf{Z} \\ \mathbf{Z}^{\mathbf{'}}\mathbf{X} & \mathbf{Z}^{\mathbf{'}}\mathbf{Z}\mathbf{+}\mathbf{I}\lambda\ \\ \end{pmatrix}\begin{pmatrix} {\widehat{\mathbf{b}}}^{l} \\ {\widehat{\mathbf{a}}}^{l} \\ \end{pmatrix}\). This can be easily done in two steps as

\[\begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{X} & \mathbf{X}^{\mathbf{'}}\mathbf{Z} \\ \mathbf{Z}^{\mathbf{'}}\mathbf{X} & \mathbf{Z}^{\mathbf{'}}\mathbf{Z + I}\lambda \\ \end{pmatrix}\begin{pmatrix} {\widehat{\mathbf{b}}}^{l} \\ {\widehat{\mathbf{a}}}^{l} \\ \end{pmatrix} = \begin{pmatrix} \mathbf{X}^{\mathbf{'}} \\ \mathbf{Z}^{\mathbf{'}} \\ \end{pmatrix}\left( \begin{pmatrix} \mathbf{X} & \mathbf{Z} \\ \end{pmatrix}\begin{pmatrix} {\widehat{\mathbf{b}}}^{l} \\ {\widehat{\mathbf{a}}}^{l} \\ \end{pmatrix} \right) + \begin{pmatrix} \mathbf{0} \\ \mathbf{I}\lambda{\widehat{\mathbf{a}}}^{l} \\ \end{pmatrix}\]

Again, only matrix \(\mathbf{Z}\) is used but its cross-product \(\mathbf{Z}^{\mathbf{'}}\mathbf{Z}\) is never computed.

Benefits of GSRU and PCG depend on the number of markers, but for large numbers they are extremely fast. For instance, a Fortran code with PCG can solve for three thousand records and one million markers in minutes. PCG has a (much) faster convergence than GSRU: see the graphs below. This makes it attractive for large application. However, GSRU can be converted with very few changes into a Gibbs Sampler application.

10.6 Estimating variances from marker models: BayesC with Pi=0

Often, estimates of variance components from field data are unreliable, too old, or not directly available. In this case, it is simpler to estimate those variances from marker data. Although this is typically done using GREML, it can also be done in marker models. This was the case of (Andrés Legarra et al. 2008) in mice, and it has later been used to estimate genetic variances in wild populations (Sillanpaa 2011). It is very simple to do using Bayesian inference, and posterior estimates of the variances \(\sigma_{a}^{2}\) and \(\sigma_{e}^{2}\) are obtained. One of such programs is GS3 (A. Legarra, Ricardi, and Filangi 2011). This method has been described by (David Habier et al. 2011) as BayesC with Pi=0 and that is how we will cite it.

The algorithm is fairly simple from a GSRU iteration scheme. Instead of iterating the solution, we sample it, then we sample the marker variance:

do j=1,niter
    do i=1,neq
        !form lhs
        lhs=xpx(i)+1/vara
        ! form rhs with y corrected by other effects (formula 1)
        rhs=dot_product(X(:,i),e)/vare +xpx(i) *sol(i)/vare
        ! MCMC sample solution from its conditional
        val=normal(rhs/lhs,1d0/lhs)
        ! update e with current estimate (formula 2)
        e=e - X(:,i)*(val-sol(i))
        !update sol
        sol(i)=val
    enddo
  ! draw variance components
    ss=sum(sol**2)+ Sa
    vara=ss/chi(nua+nsnp)
    ss=sum(e**2)+ Se
    vare=ss/chi(nue+ndata)
enddo

The algorithm requires initial values of variances and also prior information for them. Typical prior distributions for variance components are inverted-chi squared (\(\chi^{- 2})\) scaled by constants (\(S_{a}^{2}\) and \(S_{e}^{2}\) for marker and residual variances) with some degrees of freedom (\(\nu_{a}\) and \(\nu_{e}\)). The degrees of freedom represent the amount of information put on those variances and therefore whereas 4 is a small value (and almost “irrelevant”) 10,000 is a very strong prior. Typical values used in practice can be 4, for instance. On expectation, if we use a priori \(S_{e}^{2}\) and \(\nu_{e}\) then \(E\left( \sigma_{e}^{2} \middle| S_{e},\nu_{e} \right) = S_{e}^{2}/\nu_{e}\). One may use previous estimates and put therefore

\[S_{e}^{2} = \sigma_{e}^{2}\nu_{e}\]

\[S_{a} = \sigma_{a0}^{2}\nu_{a};\ \sigma_{a0}^{2} = \frac{\sigma_{u}^{2}}{2\sum_{i}^{\text{nsnp}}{p_{i}q_{i}}}\]

NOTE In other parameterizations \(E\left( \sigma_{e}^{2} \middle| S_{e},\nu_{e} \right) = S_{e}^{2}\) and \(E\left( \sigma_{a}^{2} \middle| S_{a},\nu_{a} \right) = S_{a}^{2}\) and therefore the Scale factor is in the same scale as the regular variances, and we can use \(S_{e}^{2} = \sigma_{e}^{2}\) and \(S_{a}^{2} = \sigma_{a0}^{2}\). This is the case for GS3 and the blupf90 family (Ignacio Aguilar et al. 2018).

This is equivalent to what will be discussed in next chapter about GREML and G-Gibbs.

10.7 Transforming marker variance into genetic variance

We can use the previous result to get the genetic variance from the marker variance:

\[\sigma_{u}^{2} = 2\sigma_{a0}^{2} \sum_{i}^{nsnp}{p_{i}q_{i}}\]

This is ONE estimate of genetic variance. It does not necessarily agree with other estimates for several reasons, mainly, different genetic base, different genetic model, and different data sets. However, published papers in the livestock genetics do NOT show much missing heritability – estimates of genetic variance with pedigree or markers usually agree up to, say, 10% of difference.

10.7.1 Example with mice data

An example is interesting here. The mice data set of (Andrés Legarra et al. 2008) produced estimates of genetic variance based on pedigree and of marker variance based on markers, which are summarized in the following table. The column \(\sigma_{u}^{2}\) – markers is obtained multiplying \(\sigma_{a0}^{2}\) by \(2\sum p_{i}q_{i}\)=3782.05.

Variance components in mice data
\(\sigma_{u}^{2}\) - pedigree \(\sigma_{a0}^2\) \(\sigma_u^2\) - markers
Weight 4.59 \(3.52 \times 10^{-4}\) 1.33
Growth slope (times \(10^{- 4}\)) 8.37 \(1.04 \times 10^{-3}\) 3.93
Body length 0.040 \(9.09 \times 10^{-6}\) 0.034
Body Mass Index (times \(10^{- 4}\)) 2.49 \(0.80 \times 10^{-3}\) 3.02

Results are sometimes different, why? One reason is that pedigree estimates in this particular data set are little reliable, because there is a confusion between cage and family. Markers provide more accurate estimates. Another reason is that the genetic variances estimated with pedigrees or with markers refer to two slightly different populations. Genetic variance estimated with markers refers to an ideal population in Hardy-Weinberg equilibrium and with certain allele frequencies; these are the hypothesis underlying the expression \(\sigma_{u}^{2} = \sigma_{a0}^{2}\ 2\sum p_{i}q_{i}\). Genetic variance estimated with pedigree refers to an ideal population in which founders of the pedigree are unrelated. The fact that we refer to two different ideal populations is referred to as different genetic bases (P. M. VanRaden 2008; B. J. Hayes, Visscher, and Goddard 2009) . There are essentially two methods to compare estimates from two different bases, presented by (Andres Legarra 2016; Lehermeier et al. 2017), although they refer more to a GBLUP framework.

It can be shown that if we have a pedigreed population and markers for this population, on expectation both variances are identical in Hardy-Weinberg and absence of inbreeding. We will come back to this notion later on the chapter on GBLUP and genomic relationships, and we will see how to deal with it.

10.8 Differential variances for markers

Real data, shows the presence of large QTLs (or major genes, if you prefer) in the genome. We have seen before that shrinking markers results in smaller estimates than their “true” value. On the other hand, this avoids too much error in estimation. So how can one proceed? One way is to assign shrinkage differentially. Let’s look at the equation for one marker effect:

\[{\widehat{a}}_{i} = \frac{\frac{\mathbf{z}_{i}'\tilde{\mathbf{y}}}{\sigma_{e}^{2}}} {\frac{\mathbf{z}_{i}'\mathbf{z}_{i}}{\sigma_{e}^{2}} + \frac{1}{\sigma_{a}^{2}}}\]

Where \(\widetilde{\mathbf{y}}\) means “y corrected by all other effects” and \(\sigma_{\text{ai}}^{2}\) is the shrinkage of that marker. In BLUP-SNP, we assume \(\sigma_{\text{ai}}^{2} = \sigma_{a0}^{2}\) to be constant in all markers.

It would be nice to progressively update \(\sigma_{\text{ai}}^{2}\) in order to get better estimates; intuitively, this means that the larger \({\widehat{a}}_{i}\), the larger \(\sigma_{\text{ai}}^{2}\). However, this cannot be done easily because we know that giving too much (or too little) value to \(\sigma_{\text{ai}}^{2}\) results in bad estimates. In turn, this will give bad estimates of \(\sigma_{\text{ai}}^{2}\) simply because we predict the variance of one marker with the estimate of a single marker.

10.8.1 REML formula for estimation of single marker variances

From old REML literature (e.g., see Ignacy Misztal notes), the EM formula for marker estimation should be:

\[{\widehat{\sigma}}_{\text{ai}}^{2} = {\widehat{a}}_{i}^{2} + C^{\text{ii}}\]

where \(C^{\text{ii}}\) is the element corresponding to the \(i\)-th marker on the inverse of the Mixed Model Equations \(\begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{\ }\mathbf{X} & \mathbf{X}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{\ }\mathbf{Z} \\ \mathbf{Z}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{X} & \mathbf{Z}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{\ }\mathbf{Z}\mathbf{+}\mathbf{D}^{- 1}\ \\ \end{pmatrix}\begin{pmatrix} \widehat{\mathbf{b}} \\ \widehat{\mathbf{a}} \\ \end{pmatrix} = \begin{pmatrix} \mathbf{X}^{'}\mathbf{R}^{- 1}\mathbf{y} \\ \mathbf{Z}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{y} \\ \end{pmatrix}\).

This expression has two parts, the first,\(\ {\widehat{a}}_{i}^{2}\), is the marker estimate to the square. However this estimate is way too shrunken (i.e. if the true effect of the marker is 7, the estimate may be 0.3), and the second part,\(\ C^{\text{ii}}\), compensates for this lack of information. It is known as the missing information. This estimate can be obtained from a GBLUP context (Shen et al. 2013). However, the equation is almost certainly wrong because there is just one marker effect, and even if it was, the estimate is very inaccurate, because there is only one marker effect to estimate one variance component.

10.8.2 Bayesian estimation of marker variances

Marker “variances”, can, however, be included within a Bayesian framework. The Bayesian framework will postulate a non-normal distribution for marker effects, and this non-normal distribution can be explained as a two-stages (or hierarchical) distribution. In the first stage, we postulate that each marker has a priori a different variance from each other:

\[p\left( a_{i} \middle| \sigma_{\text{ai}}^{2} \right) = N\left(0,\sigma_{\text{ai}}^{2} \right)\]

In the second stage, we postulate a prior distribution for the variance themselves:

\[p\left( \sigma_{\text{ai}}^{2} \middle| \text{something} \right) = p(\ldots)\]

This prior distribution helps (the estimate of \(\sigma_{\text{ai}}^{2}\) is more accurate, in the sense of lower mean square error) although it will still be far from reality (e.g. (Daniel Gianola et al. 2009)). At any rate, this way of working is very convenient because the solving algorithm simplifies greatly. Most Bayesian Regressions are based in this idea.

10.9 BayesA

The simplest idea is to assume that a priori we have some information on the marker variance. For instance, this can be \(\sigma_{a0}^{2}\). Thus, we may attach some importance to this value and use it as prior information for \(\sigma_{\text{ai}}^{2}\) . A natural way of doing this is using an inverted chi-squared distribution with \(S_{a}^{2}\)=\(\sigma_{a0}^{2}\nu_{a0}\) scale and \(\nu_{a0}\) degrees of freedom:

\[p\left( a_{i} \middle| \sigma_{\text{ai}}^{2} \right) = N\left( 0,\sigma_{\text{ai}}^{2} \right)\]

\[p\left( \sigma_{\text{ai}}^{2} \middle| S_{a},\nu_{a} \right) = S_{a}\chi_{\nu_{a}}^{- 2}\]

The value of \(\sigma_{a0}^{2}\) should actually be set as

\[\sigma_{a0}^{2} = \frac{\nu - 2}{\nu}\frac{\sigma_{u}^{2}}{2\sum p_{i}q_{i}}\]

Because the variance of a t distribution is \(\nu/(\nu - 2)\).

The whole setting is known as BayesA (T. H. E. Meuwissen, Hayes, and Goddard 2001). It can be shown that this corresponds to a prior on the marker effects corresponding to a scaled \(t\) distribution (Daniel Gianola et al. 2009):

\[p\left( a_{i} \middle| \sigma_{a0}^{2},\nu_{a} \right) = \sigma_{a0}t\left( 0,\nu_{a} \right)\]

which has the property of having “fat tails”. This means that large marker effects are not unlikely a priori. For instance, having an effect of 4 is 200 times more likely under BayesA with \(\nu_{a} = 4\) than BLUP-SNP. This can be seen in the Figure below.

Choosing \(\nu_{a}\) is not obvious although small values around 4 are suggested in the literature. High values give the same results as normal distribution and thus BLUP-SNP. The code for BayesA is very simple:

do j=1,niter
    do i=1,neq
        !form lhs
        lhs=xpx(i)+1/vara(i)
        ! form rhs with y corrected by other effects
        rhs=dot_product(X(:,i),e)/vare +xpx(i) *sol(i)/vare
        ! MCMC sample solution from its conditional
        val=normal(rhs/lhs,1d0/lhs)
        ! update e with current estimate (formula 2)
        e=e - X(:,i)*(val-sol(i))
        !update sol
        sol(i)=val
      ! draw variance components for markers
      ss=sol(i)**2+nua*Sa
      vara(i)=ss/chi(nua+1)
    enddo
   ! draw variance components for residual
    ss=sum(e**2)+nue*Se
    vare=ss/chi(nue+ndata)
enddo

10.10 BayesB

A very common thought at the beginning of Genomic Evaluation was that there were not many QTLs. So a natural thinking is to consider that many markers do not have effect because they cannot trace QTLs. This originated the method known as BayesB, that simply states that the individual marker variance \(\sigma_{\text{ai}}^{2}\) is potentially zero, and this can be find out. Note that this cannot happen for BayesA: the a priori chi-squared distribution prevents any marker variance from being zero.

This idea corresponds to a more complex prior as follows:

\[p\left( a_{i} \middle| \sigma_{\text{ai}}^{2} \right) = N\left( 0,\sigma_{\text{ai}}^{2} \right)\]

\[\left\{ \begin{matrix} p\left( \sigma_{\text{ai}}^{2} \middle| S_{a},\nu_{a} \right) = S_{a}\chi_{\nu_{a}}^{- 2}\ with\ probability\ 1 - \pi \\ p\left( \sigma_{\text{ai}}^{2} \middle| S_{a},\nu_{a} \right) = 0\ with\ probability\ \pi \\ \end{matrix} \right.\]

Then, when \(\sigma_{\text{ai}}^{2} = 0\) it follows that \(a_{i} = 0\).

Intuitively, this prior corresponds to the following figure. The arrow means that there is a fraction \(\pi\) of markers with zero effect.

BayesB has a complex algorithm because it does involve the computation of a complex likelihood. Details on its computation can be found on Rohan Fernando’s notes (http://www.ans.iastate.edu/stud/courses/short/2009/B-Day2-3.pdf ; slides 20 and 34; http://taurus.ansci.iastate.edu/wiki/projects/winterworkshop2013 , Notes, p. 42). and also in (Villanueva et al. 2011).

10.11 BayesC(Pi)

Whereas the premises in BayesB seem interesting, the algorithm is not. Further, experience shows that it is sensible to prior values of \(S_{a}^{2},\nu_{a}\) and \(\pi\). As explained in (David Habier et al. 2011), this suggests the possibility of a simpler prior scheme where markers having an effect would be assigned a “common” variance, say \(\sigma_{a0}^{2}\). This is simpler to be explained by introducing additional variables \(\delta_{i}\) which explain if the \(i\)-th marker has an effect or not. In turn, these variables \(\delta\) have a prior distribution called Bernouilli with a probability \(\pi\) of being 0. Therefore the hierarchy of priors is:

\[p\left( a_{i} \middle| \delta_{i} \right) = \left\{ \begin{matrix} N\left( 0,\sigma_{\text{ai}}^{2} \right)\text{\ if\ }\delta_{i} = 1 \\ 0\ otherwise \\ \end{matrix} \right.\]

\[p\left( \sigma_{a0}^{2} \middle| S_{a},\nu_{a} \right) = S_{a}\chi_{\nu_{a}}^{- 2}\]

\[p\left( \delta_{i} = 1 \right) = 1 - \pi\]

Where \(S_{a}\) can be set to something like \(S_{a}^{2}\)=\(\sigma_{a0}^{2}\nu_{a0}\) with

\[\sigma_{a0}^{2} = \frac{\sigma_{u}^{2}}{\left( 1 - \pi \right)2\sum p_{i}q_{i}}\]

Experience shows that this prior hierarchy is more robust than BayesB, the reason being that, at the end (after fitting the data), the values of \(\sigma_{a0}^{2}\) are little dependent on the prior. Thus the model may be correct even if the prior is wrong. Also, the complexity of the algorithm is greatly simplified, and can be summarized as follows:

do j=1,niter
    do i=1,neq
      ...
       ! compute loglikelihood for state 1 (i -> in model)
       ! and 0 (not in model)
       ! Notes by RLF (2010, Bayesian Methods in
       ! Genome Association Studies, p 47/67)
       v1=xpx(i)*vare+(xpx(i)**2)*vara
       v0=xpx(i)*vare
       rj=rhs*vare ! because rhs=X’R-1(y corrected)
       ! prob state delta=0
       like2=density_normal((/rj/),v0) !rj = N(0,v0)
       ! prob state delta=1
       like1=density_normal((/rj/),v1) !rj = N(0,v1)
       ! add prior for delta
       like2=like2*pi; like1=like1*(1-pi)
       !standardize
       like2=like2/(like2+like1); like1=like1/(like2+like1)
       delta(i)=sample(states=(/0,1/),prob=(/like2,like1/)
       if(delta(i)==1) then
         val=normal(rhs/lhs,1d0/lhs)
       else
         val=0
       endif
    ...
    enddo
   pi=1- & beta(count(delta==1)+aprioriincluded,
                count(delta==0)+apriori_not_included)
    ss=sum(sol**2)+nua*Sa
    vara=ss/chi(nua+count(delta==1))

enddo

10.11.1 Markers associated to the trait

The value of \(1 - \pi\) (the number of markers having an effect) can be either fixed to a value or estimated from data. This is achieved in the last lines of the code above. How is this possible? Intuitively, we look at the number of markers who have (\(\delta = 1\)) or not (\(\delta = 0\)) an effect. Then we add a prior information on \(\pi\). This comes in the form of a \(\text{Beta}(a,b)\) distribution, which is a distribution of fractions between 0 and 1, saying that our fraction is a priori “like if” we had drawn \(a\) black balls and \(b\) red balls from an urn to make \(\pi = a/(a + b)\).

The genetic variance explained by markers in BayesC(Pi) is equal to

\[{\sigma_{u}^{2} = \sigma_{a0}^{2}\left( 1 - \pi \right)2\sum p_{i}q_{i}}\]

Thus, the same total genetic variance can be achieved with large values of \(\sigma_{a0}^{2}\) and small values of \(\left( 1 - \pi \right)\) or the opposite. This implies that there is a confusion between both, and it is not easy to find out how many markers should be in the model. For instance, (Colombani et al. 2013) reported meaningful estimates of \(\pi\) for Holstein but not for Montbeliarde.

Concerning markers, we have indicators of whether a given marker “is” or “is not” in the model, and these have been used as signals for QTL detection. However this is not often as expected. The output of BayesC(Pi) will be \({\widehat{\delta}}_{i}\) , the posterior mean of \(\ \delta_{i}\). This value will NOT be either 0 or 1 but something in between. So BayesCPi cannot be used to select “the set of SNPs controlling the trait” because such a thing does not exist: there are many possible sets. The following graph shows the kind of result that we obtain:

QTL signals from BayesCPi with Pi=0.999

How can we declare significance? There is no such thing as p-values. We may though use the Bayes Factor (Wakefield 2009 ; Luis Varona 2010) :

\[BF = \frac{\frac{p\left( \text{SNP~in~the~model} \right|data)}{p\left( \text{SNP~not~in~the~model} \right|data)}}{\frac{p\left( \text{SNP~in~the~model} \right)}{p\left( \text{SNP~not~in~the~model} \right)}}\]

In our case this is:

\[BF_{i} = \frac{\left( 1 - \pi \right)}{\pi}\frac{p\left( \delta_{i} = 1\mid\mathbf{y} \right)}{1 - p\left( \delta_{i} = 1\mid\mathbf{y} \right)}\]

What thresholds should we use for BF? Some people suggest using permutations \(\rightarrow\) too long. We can use a scale adapted by (Kass and Raftery 1995) sometimes used in QTL detection (L. Varona, García-Cortés, and Pérez-Enciso 2001 ; Vidal et al. 2005):

Something remarkable is that there is no need for multiple testing (Bonferroni) correction because all SNP were introduced at the same time, and the prior already « penalizes » their estimates (Wakefield 2009). We compared several strategies for GWAS including BayesCPi and our conclusion is that all result in similar results (Andres Legarra, Croiseau, et al. 2015).

10.12 Bayesian Lasso

The Bayesian Lasso (Park and Casella 2008 ; Campos et al. 2009 ; Andrés Legarra et al. 2011) suggests a different way to model the effect of markers. Instead of setting a priori some of them to 0, it sets them to very small values, as in the following Figure.

This corresponds in fact to the following a priori distribution of markers:

\[p\left( a_{i} \middle| \lambda \right) = \frac{\lambda}{2\sigma}\exp\left( - \frac{\lambda\left| a_{i} \right|}{\sigma} \right)\]

where the density function is on the absolute value of the marker and not on its square like in the normal distribution. Coming back to our notion of variance of markers, (Park and Casella 2008 ; Campos et al. 2009 ) showed that the model is equivalent to a model with individual variances by marker, that is:

\[p\left( a_{i} \middle| \sigma_{\text{ai}}^{2} \right) = N\left( 0,\sigma_{\text{ai}}^{2} \right)\]

\[p\left( \sigma_{\text{ai}}^{2} \middle| \lambda \right) = \frac{\lambda^{2}}{2}\exp\left( - \frac{\lambda^{2}}{2}\ \frac{\sigma_{\text{ai}}^{2}}{\sigma^{2}} \right)\]

(NOTE: the \(\lambda\) here has nothing to do with the \(\lambda\) in BLUP-SNP). The latter density function is a prior distribution on the marker variances that is known as exponential. This is very similar to BayesA, in that a prior distribution is postulated for marker variances. The difference is the nature of this prior distribution (exponential in Bayesian Lasso and inverted chi-squared in BayesA), that can be seen in the following Figure. It can be seen that, whereas in Bayesian Lasso very small variances are a priori likely, this is not the case in BayesA.

In practice, we have found that the Bayesian Lasso has a much better convergence than BayesCPi, while being as accurate for predictions (Colombani et al. 2013)).

10.12.1 Parameterization of the Bayesian Lasso

The term \(\sigma\) in the parameterization above has been subject to small debate. The original implementation of (Park and Casella 2008) considered \(\sigma^{2} = \sigma_{e}^{2}\), the residual variance. (Andrés Legarra et al. 2011) objected that it was unnatural to model the distribution of markers on the distribution of residuals and suggested setting \(\sigma^{2} = 1\). In this way, the interpretation of \(\lambda\) is quite straightforward as a reciprocal of the marker variance, because in such case \(Var\left( a_{i} \middle| \lambda \right) = 2/\lambda^{2}\). In this case, a natural way of fitting the prior value of \(\lambda\) is as

\[\frac{2}{\lambda^{2}} = \frac{\sigma_{u}^{2}}{2\sum p_{i}q_{i}}\]

This is the default in software GS3. The algorithm with this parameterization is rather simple:

do j=1,niter
  do i=1,neq
    !form lhs
    lhs=xpx(i)+1/vara(i)
    rhs=dot_product(X(:,i),e)/vare +xpx(i) *sol(i)/vare
    val=normal(rhs/lhs,1d0/lhs)
    e=e - X(:,i)*(val-sol(i))
    sol(i)=val
    ! draw variance components
    ss=sol(i)**2
    tau2(i)=1d0/rinvGauss(lambda2/ss,lambda2)
  enddo
  ! draw variance components
  ss=sum(e**2)+nue*Se
  vare=ss/chi(nue+ndata)
  ! update lambda
  ...
enddo

The alternative implementation takes \(\sigma^{2} = \sigma_{e}^{2}\), and can be found in R package BLR (Pérez et al. 2010). In this case, a natural way of fitting the prior value of \(\lambda\) is as (Pérez et al. 2010)

\[\frac{2}{\lambda^{2}} = \frac{\sigma_{u}^{2}}{\sigma_{e}^{2}2\sum p_{i}q_{i}}\]

In this case, \(\lambda\) can be thought of as a ratio between marker variance and residual variance (signal-to-noise). Both parameterizations are not strictly equivalent depending on the priors used for \(\lambda\) and the different variances, but they should give very similar results (in spite of Legarra et al. 2011b()).

10.13 Stochastic Search Variable Selection

Yet another method, it does postulate two kinds of markers: those with a large effect, and those with a small (but not zero) effect. These are, similarly to BayesC(Pi), reflected in two variances, one for the large effects (\(\sigma_{\text{al}}^{2}\)) and one for the small effects (\(\sigma_{\text{as}}^{2}\)). The idea was from (George and McCulloch 1993), and details can be found in e.g. (Verbyla et al. 2009). The advantage of this method is that it is rather fast and does not require likelihood computations, although choosing a priori the proportions of “large” and “small” effects might be tricky.

10.14 Overall recommendations for Bayesian methods

BayesB seems to be little robust. The other methods are reasonably robust. My (AL) personal suggestion is to start from BLUP-SNP, which is very robust, then progress to other methods. Meaningful prior information (for instance how to set up \(\lambda\) from genetic variance) is relevant, if not for anything else, to have correct starting values. Bayesian methods often give similar precisions than BLUP-SNP, but important exceptions such as fat and protein content in dairy cattle do exist.

10.15 Empirical single marker variances from marker estimates

In SNP-BLUP (or equivalently, from GBLUP) it is easy to get marker estimates but running a full Bayesian analysis can be long or impossible. So, people came with ideas to get these weights (P. M. VanRaden 2008 ; Wang et al. 2012 ; Fragomeni et al. 2017)

10.16 VanRaden’s NonLinear methods

Gibbs samplers are notoriously slow and this hampers the implementation of Bayesian methods for genomic predictions. VanRaden (P. M. VanRaden 2008) presented NonLinearA and NonLinearB, iterative methods that do not need samplers and converge in a few iterations. NonLinearA assumes a certain departure from normality, called “curvature” (say \(c\)) that oscillates between 1 (regular BLUP-SNP) and 1.25 (Cole et al. 2009), such that the distribution would resemble more closely a fat-tailed distribution like Bayesian Lasso or BayesA. In our notation, this means that the marker variance is updated as

\[\sigma_{\text{ai}}^{2} = \sigma_{a0}^{2}\left( c^{\left( \frac{\left| {\widehat{a}}_{i} \right|}{\text{sd}\left( {\widehat{a}}_{1},\ldots\ ,{\widehat{a}}_{n} \right)} - 2 \right)} \right)\]

The role of the curvature is similar (but goes in the opposite direction) to the degrees of freedom in BayesA. The more the curvature, the more large marker effects are allowed. For instance, if \(c = 1.25\) and a marker estimate is an outlier in the distribution of marker estimates, and has for instance a standardized value of 2.5, its variance \(\sigma_{\text{ai}}^{2}\) will be increased by \({1.25}^{0.5} = 1.12\). To avoid numerical problems, for small data sets, it is recommended to use \(c = 1.12\) and to impose a limit of 5 for \(\frac{\left| {\widehat{a}}_{i} \right|}{\text{sd}\left( {\widehat{a}}_{1},\ldots\ ,{\widehat{a}}_{n} \right)}\) (VanRaden, personal communication). This algorithm is fast, stable and regularly used for dairy cattle genomic evaluation.

The whole setting is very similar to BayesA or to the Bayesian Lasso, with \(c\) playing the role of \(\lambda\). The prior density for marker effects departs from normality for marker beyond two standard deviations, as shown in the next Figure. It can be seen that large marker effects are much more likely in nonlinearA than in a normal density.

The NonLinearB is akin to BayesC(Pi) (some markers are 0 and other share a common variance), whereas NonLinearAB is similar to BayesA (some markers are zero and others have a variance that might change from marker to marker). NonLinearB uses a mixture distribution, in which \(\sigma_{ai}^{2}\) is obtained from a average of variances weighted by the likelihood that the marker has zero effect or not. However the algorithm will not be further detailed here.

10.17 The effect of allele coding on Bayesian Regressions

We have explained how allele coding should (or can) proceed. [(I. Strandén et al. 2017) analyzed the result of allele coding in genomic predictions. One need to distinguish carefully two things here. What we mean by allele coding is coding of matrix \(\mathbf{Z}\) for genotypes, not the frequencies used in \(\sigma_{a0}^{2} = \frac{\sigma_{u}^{2}}{2\sum_{i}^{\text{nsnp}}{p_{i}q_{i}}}.\)

One of their results is that, for any model including a “fixed” effect such as an overall mean \(\mu\) or a cross-classified effect (e.g., sex) estimates of marker effects \(\widehat{\mathbf{a}}\) and estimated genetic values \(\widehat{\mathbf{u}}\mathbf{=}\mathbf{Z}\widehat{\mathbf{a}}\) are invariant to parametrization of \(\mathbf{Z}\) (centered, 101 or 012 or 210), up to a constant. This constant will go into the overall mean or fixed effect. Consider for instance the mean. The mean of the genetic values of the population will be \(\mathbf{1}\mathbf{'}\widehat{\mathbf{u}}\) , and this mean is not invariant to parameterization, and cannot either be separated from the overall mean of the model, \(\mu\). If the centered coding is used, then \(\mathbf{1}^{\mathbf{'}}\widehat{\mathbf{u}}\mathbf{=}\mathbf{1}^{\mathbf{'}}\mathbf{Z}\widehat{\mathbf{a}} = 0\). As for the marker variance \(\sigma_{a0}^{2}\) estimated by, say, BayesC, they also proved that it is invariant to parameterization of \(\mathbf{Z}\).

In other words, we can use any coding (centered, 101 or 012 or 210) in \(\mathbf{Z}\) for Bayesian methods. The estimated \(\widehat{\mathbf{u}}\) will be the same, the estimated \(\sigma_{a0}^{2}\) or \(\pi\) will be the same, and the estimated genetic variance computed using, for instance, \(\sigma_{u}^{2} = \sigma_{a0}^{2}2\sum_{i}^{\text{nsnp}}{p_{i}q_{i}}\) will be the same too.

These results are convenient because they assure us that any allele coding is convenient. However, this result does not apply to the all features. For instance, the standard deviation (and therefore, in animal breeding words, the “model-based” reliability) of estimated genetic values \(\widehat{\mathbf{u}}\) is not invariant to parameterization, because there will be a part of the overall mean absorbed, or not, by \(\mathbf{Z}\widehat{\mathbf{a}}\). This implies that reports of the posterior variance of \(\widehat{\mathbf{u}}\) will depend on the allele coding. The same result applies to GBLUP, as we will see later.

10.18 Reliabilities from marker models

10.18.1 Standard errors from Bayesian methods by MCMC

In these methods, at iteration \(t\), samples of distribution of marker effects is obtained in the form of samples of these effects (\({\widetilde{\mathbf{a}}}_{(t)}\)). At iteration \(t\), samples of the breeding values can be obtained as \({\widetilde{\mathbf{u}}}_{(t)} = \mathbf{Z}{\widetilde{\mathbf{a}}}_{(t)}\). At the end of the MCMC process, the final estimate of the breeding value for, say, individual \(i\) consist of a posterior mean of all \(\widetilde{u}\) for that animal,

\[{\widehat{u}}_{i} = {\overline{\widetilde{u}}}_{i}\]

and a posterior variance \({Var(\widehat{u}}_{i}) = Var\left( {\widetilde{u}}_{i} \right)\). This variance (or rather, its square root: the standard error) can be used in itself as a descriptor of the incertitude of the breeding value. A 95% confidence interval for the breeding value is roughly \({\widehat{u}}_{i} \pm 2\text{sd}\left( {\widehat{u}}_{i} \right)\).

10.18.2 Reliabilities

Reliabilities are only well defined for a multivariate normal model – SNP-BLUP with fixed \(\sigma_{a0}^{2}\). The first method uses \(Var\left( {\widehat{u}}_{i} \right)\) as above (i.e. from MCMC). Reliability can be obtained as

\[\text{Re}l_{i} = 1 - \frac{Var\left( {\widehat{u}}_{i} \right)}{\mathbf{z}_{i}\mathbf{z}_{i}^{'}\sigma_{a0}^{2}}\]

The second method uses the complete a posteriori distribution of marker effects:

\[Var\left( \mathbf{a} \middle| \mathbf{y} \right) = \mathbf{C}^{\text{aa}}\]

That can be obtained by MCMC or by inversion of the SNP-BLUP equations. From here we can derive that:

\[Var\left( \mathbf{u} \middle| \mathbf{y} \right) = \mathbf{Z}\mathbf{C}^{\text{aa}}\mathbf{Z}^{'}\]

And therefore \(Var\left( {\widehat{u}}_{i} \right) = \mathbf{z}_{i}\mathbf{C}^{\text{aa}}\mathbf{z}_{i}^{'}\). The rest proceeds as before.

Imagine for instance that we have 50K markers and 1 million animals in predictions. Imagine that we use SNP-BLUP equations and we can obtain by inversion \(\mathbf{C}^{\text{aa}}\), which is a 50K by 50K matrix. Then, for each animal, we compute \(\mathbf{z}_{i}\mathbf{C}^{\text{aa}}\mathbf{z}_{i}^{'}\) (which has high cost) and \(\mathbf{z}_{i}\mathbf{z}_{i}^{'}\sigma_{a0}^{2}\) (which has negligible cost).

These reliabilities have a problem. We know that both \({\widehat{u}}_{i}\) and \(Var\left( {\widehat{u}}_{i} \right)\) are invariant to parametrization (coding of \(\mathbf{Z}\)). But \(\mathbf{z}_{i}\mathbf{z}_{i}^{'}\) depends on the parametrization, and therefore we can obtain exactly the same breeding values but different reliabilities in function of the chosen coding.

11 Genomic relationships

11.1 Reminder about relationships

Wright (1922) introduced the notion of relationships as correlation between genetic effects of two individuals. For practical reasons, it is more convenient to use what is often called “numerator relationship” (Quaas 1976) or simply “relationship” or “additive relationship”. This equals the standardized covariance (not the correlation) between the additive genetic values of two individuals. The pedigree relationship is not equal to the correlation if there is inbreeding. There are several terms used to talk about relationships, and here we will present the classical definitions according to pedigree:

All these measures of relatedness are defined with respect to a base population constituted by founders, which are assumed unrelated and carriers of different alleles at causal QTLs. This generates, as a byproduct, that relationships estimated using pedigrees are strictly positive. However, this is not the case when we consider marker or QTL information.

11.2 Identity by state and identity by descent of two individuals

The probability of Identity by state (IBS), or “molecular” coancestries (that we will denote \(f_{\text{Mij}}\) ) refers to the numbers of alleles shared by two individuals, and it is equal to the probability that two alleles picked at random, one by individual, are identical. For the purposes of these notes we will refer to molecular relationships, which are \(r_{\text{Mij}} = 2f_{\text{Mij}}\) (to be on the same scale as \(A_{\text{ij}}\)). These \(r_{\text{Mij}}\) are sometimes called “similarity index” but also as “total allelic relationship” (Nejati-Javaremi, Smith, and Gibson 1997)). For the two-allele case, this is summarised in the following table:

Molecular relationships for combinations of different genotypes
AA Aa aa
AA 2 1 0
Aa 1 1 1
aa 0 1 2

In fact, the molecular relationship can be obtained in a mathematical form without counting because (M. Á. Toro, García-Cortés, and Legarra 2011)

\[r_{\text{Mij}} = z_{i}z_{j} - z_{i}{- z}_{j} + 2\]

Where \(z_{i}\) is coded as \(\{ 0,1,2\}\). This expression, connected with genomic relationships, will show its utility later on.

The identity by state reflected in the molecular relationship \(r_{\text{Mij}}\) and the identity by descent (IBD) reflected in the pedigree relationships \(A_{\text{ij}}\) have a well-known relationship that is periodically revisited (Li and Horvitz 1953 ; Li and Horvitz 1953 ; Eding and Meuwissen 2001 ; Powell, Visscher, and Goddard 2010 ; M. Á. Toro, García-Cortés, and Legarra 2011). A formal derivation can be found in (C. C. Cockerham 1969) (see also (M. Á. Toro, García-Cortés, and Legarra 2011) . A simple one is as follows. Consider one allele sampled from individual \(i\) and another allele sampled from individual \(j\). They can be identical because they were identical by descent (with probability \(A_{\text{ij}}/2\)), or because they were not identical by descent (with probability \(1 - A_{\text{ij}}/2)\) but they were identical just by chance (with probability \(p^{2} + q^{2}\)). Therefore, \(f_{\text{Mij}} = \theta_{\text{ij}} + \left( 1 - \theta_{\text{ij}} \right)\left( p^{2} + q^{2} \right)\) where \(\theta_{\text{ij}} = A_{\text{ij}}/2\) is the pedigree coancestry, and

\[r_{\text{Mij}} = A_{\text{ij}} + \left( 2 - A_{\text{ij}} \right)\left( p^{2} + q^{2} \right)\]

also,

\[A_{\text{ij}} = \frac{r_{\text{Mij}} - 2p^{2} - 2q^{2}}{2pq}\]

Thus, IBS is biased upwards with respect to IBD. Reordering we have that:

\[\left( 1 - f_{\text{Mij}} \right) = (1 - \theta_{\text{ij}})(1 - p^{2} - q^{2})\]

Which is in the form of Wright’s fixation indexes. This means that molecular heterozygosity, or in other words, “not alikeness” of two individuals, equals “not alikeness” by descendance times “not alikeness” of markers.

There is another important point. The expression above to get IBD relationships from IBS relationships is identical to VanRaden’s \(\mathbf{G}\) that will be detailed later, up to a constant. Therefore, the results will be identical using IBD or IBS relationships. (We will come back to this later).

11.2.1 Covariance between individuals

What does it mean “covariance between individuals”? The covariance is always computed across several pairs of things:

a b
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 3

Here , \(\text{Cov}\left( a,b \right) = 1.17\)

So, how can we define the covariance the genetic value of two individuals \(i\)and \(j\) (for instance bulls ALTACEASAR and BODRUM)? These guys are just two – you can’t compute a covariance with just one pair. ALTACEASAR and BODRUM have a defined true genetic value, that we don’t know. So, you cannot calculate a covariance between their true breeding values, because there is only one repetition of the pair. However, the mental construction is as follows. If I repeated events (or I simulate) in my cattle pedigree (transmission of QTL from parents to offspring) many times, individuals ALTACEASAR and BODRUM would have inherited different QTLs and therefore show different genetic values at different repetitions. The covariance of these two hypothetical vectors of genetic values is what we call the covariance between individuals.

11.3 Relationships across individuals for a single QTL

Assume that you are studying one species with a single biallelic quantitative gene. You genotype the individuals and you are asked, what is the covariance between individuals \(i\) and \(j\), for which the genotype is known? Let express the breeding values as functions of the genetic value (\(za\)) deviated from the population mean, \(\mu = 2pa\):

\[u_{i} = z_{i}a - 2pa = \left( z_{i} - 2p \right)a\]

\[u_{j} = z_{j}a - 2pa = \left( z_{j} - 2p \right)a\]

where \(z_{i}\) is expressed as \(\{ 0,1,2\}\) copies of the allele of reference of the QTL having the effect \(a_{i}\) (let’s say allele A). If the effect of the QTL has some prior distribution with variance \(Var\left( a \right) = \sigma_{a}^{2}\), and the genetic variance in Hardy-Weinberg equilibrium is \(2\text{pq}\sigma_{a}^{2}\). It follows from regular rules of variances and covariances that

\[\text{Cov}\left( u_{i},u_{j} \right) = \left( z_{i} - 2p \right)\left( z_{j} - 2p \right)\sigma_{a}^{2}\]

If we define \(z_{i}^{*} = z_{i} - 2p\), in other words, we use the “centered” coding instead of “012”, then the covariance between two individuals is equal to \(z_{i}^{*}z_{j}^{*}\sigma_{a}^{2}\) .

Dividing the covariance \(z_{i}^{*}z_{j}^{*}\sigma_{a}^{2}\)by the genetic variance \(2\text{pq}\sigma_{a}^{2}\) we obtain additive relationships produced by the QTL. I will call these additive relationships \(r_{\text{Qij}}\). Two examples for \(p = 0.5\) and \(p = 0.25\) are shown in the next tables:

Relationships \(r_{\text{Qij}}\) between individuals for a single QTL with \(p = 0.5\)
AA Aa aa
AA 2 0 -2
Aa 0 0 0
Aa -2 0 2
Relationships \(r_{\text{Qij}}\)between individuals for a single QTL with \(p = 0.25\)
AA Aa aa
AA 6 2 2
Aa 2 2/3 -2/3
Aa -2/3 -2/3 2/3

11.3.1 Negative relationships

Now, this is puzzling because we have negative relationships. The reason for this is that we have imposed the breeding values to refer to the average of the population. However, there is no error. We need to interpret the values as standardized correlations (P. M. VanRaden 2008 ; Powell, Visscher, and Goddard 2010). This was also frequently done by Wright, who would accept “negative” inbreedings. The intuitive explanation is that if the average breeding value is to be zero, some animals will be above zero and some below zero. Animals carrying different genotypes will show negative covariances.

These relationships can NOT either be interpreted as probabilities. Correcting negative relationships (or genomic relationships) to be 0 is a serious conceptual error and this gives lots of problems, yet it is often done.

11.3.2 Centered relationships and IBS relationships

It can be noted that the Table above with \(p = 0.5\) is equal to the Table 16 of molecular (or IBS) relationships before, minus a value of 1, times 2:

\[ 2 \left(\begin{bmatrix} 2 & 1 & 0 \\ 1 & 1 & 1 \\ 0 & 1 & 2 \end{bmatrix} - \begin{bmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} \right) = \begin{pmatrix} 2 & 0 & -2 \\ 0 & 0 & 0 \\ -2 & 0 & 2 \end{pmatrix} \]

This shows that relationships at the QTL can be expressed as IBS at the QTL (Nejati-Javaremi, Smith, and Gibson 1997), and they can be interpreted as twice a probability, as regular relationships in A. The constant value of 1 across all IBS relationships will be factored out in the mean (Ismo Strandén and Christensen 2011) and models using either parameterization (and also any assumed \(p\)) will give identical estimates of breeding values in the GBLUP context that we will see later on.

Therefore, using IBS relationships or genomic relationships gives identical estimates of breeding values –if associated variance components are comparable.

11.3.3 Inbreeding at a single QTL

Inbreeding would be the value of the self-relationship \(r_{\text{Qii}}\) , minus 1. This is puzzling because we have negative values for heterozygotes. What this means is that there is less homozygosity than expected (Falconer and Mackay 1996).

11.4 Genomic relationships: Relationships across individuals for many markers

These methods use SNP to infer relationships among individuals, quantifying the number of alleles shared between two individuals. Genomic relationships start from identical by state (IBS) because they consider the fact that two alleles randomly picked from each individual are identical, independently of origin. However, they are later modified to conform to Identity by Descent. Pedigree relationships are identical by descent (IBD) because they consider the shared alleles come from the same ancestor. However, they are also incorrect for two reasons. First, they consider that the genome is infinite, whereas the genome is in fact not infinite, and therefore the pedigree relationships will be correct only on average. Second, the pedigree is not infinite – it is known for a number of generations (the most that I’ve handled is 40).

11.4.1 VanRaden’s first genomic relationship matrix

We proceed to derive relationships for many markers as we did for one QTL. The derivation is fairly easy and purely statistical. To refer breeding values to an average value of 0, we adopt the “centered” coding for genotypes described before and shown below:

Additive coding for marker effects at locus i with reference allele \(A\).
Genotype 101 Coding 012 Coding Centered coding
aa \[- a_{i}\] \[0\] \[- 2p_{i}a_{i}\]
Aa \[0\] \[a_{i}\] \[(1 - 2p_{i})a_{i}\]
AA \[a_{i}\] \[2a_{i}\] \[\left( 2 - 2p_{i} \right)a_{i}\]

In theory, to refer the breeding values to the pedigree base population, we should use allelic frequencies of the base population but these are rarely available (although Gengler’s method can be used). Often current observed frequencies are used. At any rate, we have that

\[\mathbf{u = Za}\]

That is, individuals are a sum over genotypes of markers’ effects. We have shown that marker effects can be considered to have an a priori distribution, and this a priori distribution has a variance

\[Var\left( \mathbf{a} \right)\mathbf{= D}\]

With

\(\mathbf{D}\mathbf{=}\begin{pmatrix} \sigma_{a1}^{2} & 0 & \ldots & 0 \\ 0 & \sigma_{a2}^{2} & \ldots & 0 \\ \ldots & \ldots & \ldots & \ldots \\ 0 & 0 & \ldots & \sigma_{\text{an}}^{2} \\ \end{pmatrix}\)

If we fit different variances by marker, but that is usually assumed as \(\mathbf{D} = \mathbf{I}\sigma_{a0}^{2}\). Then, the variance-covariance matrix of breeding values is

\[Var\left( \mathbf{u} \right) = \mathbf{Z}Var\left( \mathbf{a} \right)\mathbf{Z}^{\mathbf{'}}\mathbf{= ZD}\mathbf{Z}^{\mathbf{'}}\mathbf{= Z}\mathbf{Z}^{\mathbf{'}}\sigma_{a0}^{2}\]

Do not confound \(Var\left( \mathbf{u} \right)\) (which is a matrix) with \(Var(u)\) (which is a scalar \(Var\left( u \right) = E\left( u^{2} \right) - E\left( u \right)^{2} = \sigma_{u}^{2}\)). Elements in \(\mathbf{Z}\mathbf{Z}^{\mathbf{'}}\sigma_{a0}^{2}\) are however NOT relationships. Relationships are standardized covariances. The variance we need to divide by is the genetic variance or, in other words, the variance of the breeding values of a set of animals. If we assume our population to be in Hardy-Weinberg and Linkage equilibrium, then we have shown that

\[\sigma_{u}^{2} = 2\sum_{i = 1}^{\text{nsnp}}{p_{i}q_{i}\sigma_{a0}^{2}}\]

Therefore, we can now divide \(Var\left( \mathbf{u} \right)\) above by this variance and this gives the genomic relationship matrix (P. M. VanRaden 2008):

\[\mathbf{G =}\frac{\mathbf{Z}\mathbf{Z}^{\mathbf{'}}}{2\sum p_{i}q_{i}}\]

When we divide \(\mathbf{Z}\mathbf{Z}^{\mathbf{'}}\) by \(\sum_{}^{}{p_{i}(1 - p_{i})}\), G becomes analogous to the numerator relationship matrix (A). Quoting VanRaden: “The genomic inbreeding coefficient for individual j is simply \(G_{\text{jj}}\ - \ 1\), and genomic relationships between individuals j and k, which are analogous to the relationship coefficients of Wright (1922), are obtained by dividing elements \(G_{\text{jk}}\) by square roots of diagonals \(G_{\text{jj}}\) and \(G_{\text{kk}}\).” The G matrix measures the number of homozygous loci for each individual in the diagonals, and it also measures the number of alleles shared among individuals in the off-diagonals. These measures are not IBS – they are IBS modified by the centering due to allelic frequencies, and therefore they become a good approximation of real IBD, and this approximation is better than the approximation than we obtain of the pedigree. This is one of the reasons why genomic predictions are better than pedigree prediction.

11.4.2 VanRaden’s second genomic relationship matrix

A second matrix suggested by (P. M. VanRaden 2008) but made popular by (and often incorrectly attributed to) (Yang et al. 2010) weights each marker differentially, using a matrix of weights \(\mathbf{D}_{w}\). \(Var\left( \mathbf{u} \right) = \mathbf{Z}\mathbf{D}_{w}\mathbf{\ }\mathbf{Z}^{\mathbf{'}}\sigma_{u}^{2}\) where genomic relationships are

\[\mathbf{G = Z}{\mathbf{D}_{w}\mathbf{Z}}^{\mathbf{'}}\]

with

\[\mathbf{D}_{\mathbf{w}}\mathbf{=}\begin{pmatrix} \frac{1}{n\ 2p_{1}q_{1}} & 0 & \ldots & 0 \\ 0 & \frac{1}{n\ 2p_{2}q_{2}} & \ldots & 0 \\ \ldots & \ldots & \ldots & \ldots \\ 0 & 0 & \ldots & \frac{1}{n\ 2p_{n}q_{n}} \\ \end{pmatrix}\]

Where \(n\) is the number of markers. This matrix can be interpreted as a weighted average of genomic relationships, one by marker:

\[\mathbf{G =}\frac{1}{\text{nsnp}}\sum_{i = 1}^{\text{nsnp}}\mathbf{G}_{i} = \frac{1}{\text{nsnp}}\sum_{i = 1}^{\text{nsnp}}\frac{\mathbf{z}_{i}\mathbf{z}_{i}^{'}}{2p_{i}q_{i}}\]

where \(\mathbf{z}_{i}\) is a vector with genotypes for marker \(i\). This corresponds as well to \(Var\left( \mathbf{u} \right) = \mathbf{ZDZ'}\) where

\[\mathbf{D =}\begin{pmatrix} \frac{\sigma_{u}^{2}}{n\ 2p_{1}q_{1}} & 0 & \ldots & 0 \\ 0 & \frac{\sigma_{u}^{2}}{n\ 2p_{2}q_{2}} & \ldots & 0 \\ \ldots & \ldots & \ldots & \ldots \\ 0 & 0 & \ldots & \frac{\sigma_{u}^{2}}{n\ 2p_{n}q_{n}} \\ \end{pmatrix}\]

This “second” genomic relationship, that is quite used, has several problems. The first is that is very sensible to small allelic frequencies, that will give high weight to very rare alleles. For monomorphic alleles (\(p = 0\) or 1) the matrix is undefined, which is not the case in the “first \(\mathbf{G}\)

The second problem is that it assumes that the contribution of each marker to the overall \(\mathbf{G}\) are identical in terms of variance, which means that markers with small allelic frequencies have large effects. The genetic variance contributed by marker \(i\) is equal to \(\sigma_{u}^{2}/n\), irrespectively of its allelic frequence, and \(\sigma_{\text{ai}}^{2} = \sigma_{u}^{2}/n2p_{i}q_{i}\). Consider two loci with different allelic frequencies \(\left\{ 0.1,0.5 \right\}\) and \(\sigma_{u}^{2} = 1\). The first loci will have \(\sigma_{a1}^{2} = 5.5\) and the second \(\sigma_{a2}^{2} = 2\). Therefore, using this matrix imposes different a priori variances of markers depending on their frequencies. This has no biological reason, in my opinion (AL).

11.4.3 Allelic frequencies to put in genomic relationships

There is some confusion on the allelic frequencies to use in the construction of \(\mathbf{G}\). (Ismo Strandén and Christensen 2011) proved that, if the form is \(\mathbf{G}\mathbf{=}\mathbf{Z}\mathbf{Z}^{\mathbf{'}}/2\sum p_{i}q_{i}\) , the allele frequencies used to construct \(\mathbf{Z}\) are irrelevant, and the only change from using different allelic frequencies is that they shift by a constant that is absorbed by the mean. To obtain unbiased values in the same scale as regular relationships, one should use base population allelic frequencies.

However, the allelic frequency in the denominator is more important. The expression \(\sigma_{u}^{2} = 2\sum_{i = 1}^{\text{nsnp}}{p_{i}q_{i}\sigma_{a0}^{2}}\) puts genetic variance in one population as a function of the allelic frequencies in the same population. Thus, dividing by the current allelic frequencies implies that we refer to the current genetic variance. If there are many generations between current genotypes and pedigree base the genetic variance will reduce. Ways to deal with these will be suggested later.

11.4.4 Properties of G

We will refer here to properties derived for \(\mathbf{G}\mathbf{=}\mathbf{Z}\mathbf{Z}^{\mathbf{'}}/2\sum p_{i}q_{i}\) if “observed” genomic relationships are used.

11.4.4.1 The average value of \(\mathbf{u}\) is 0

The first property is that the average value of \(\mathbf{u}\) is 0, because \(\mathbf{Z}\) is centered.

11.4.4.2 The average value of \(\mathbf{G}\) is 0

The second property is that, the average value of \(\mathbf{G}\) is 0. The reason for this is that, by centering the matrix \(\mathbf{Z}\), the product \(\mathbf{1'Z}\) is equal to a row vector of 0’s, as each column of \(\mathbf{Z}\) sums to 0 by its centering. And \(mean(\mathbf{G})=\frac{(\mathbf{1'Z})(\mathbf{Z'1})}{m^2 2\sum p_{i}q_{i}}\) with \(m\) the number of animals.

A related property is that in case of Linkage Equilibrium, terms of \(\mathbf{Z}^{\mathbf{'}}\mathbf{Z}\) sum to zero, for the following. These are the crossproducts of covariables associated with loci \(i\) and \(j\). In LE, these crossproducts occur with frequency \(\left( 1 - p_{i} \right)\left( 1 - p_{j} \right)\) for the co-occurrence of alleles “a” in \(i\) and “a” in \(j\), \(\left( p_{i} \right)\left( 1 - p_{j} \right)\) for “A” and “a”, and so on. Then, by summing in order genotypes at respective loci \(i\) and \(j\) “a” and “a”, “a”' and “A”, “A” and “a”, and “A” and “A”, weighted by the respective frequencies:

\[ \begin{aligned} E \left( \mathbf{z_i ' z_j} \right) = & \left( 1 - p_{i} \right)\left( 1 - p_{j} \right)\left( - p_{i} \right)\left( - p_{j} \right) + \\ & \left( p_{i} \right)\left( 1 - p_{j} \right)\left( 1 - p_{i} \right)\left( - p_{j} \right) + \\ & \left( 1 - p_{i} \right)\left( p_{j} \right)\left( - p_{i} \right)\left( 1 - p_{j} \right) + \\ & \left( p_{i} \right)\left( p_{j} \right)\left( 1 - p_{i} \right)\left( 1 - p_{j} \right) = 0 \end{aligned} \]

A verbal explanation is that, if the average value of \(\mathbf{u}\) is if 0, then some animals will be more related than the average and others less related than the average – hence the 0 average relationship.

11.4.4.3 The average value of the diagonal of \(\mathbf{G}\) is 1 if there is no inbreeding

This requires Hardy-Weinberg (but not linkage equilibrium). This can be seen by noting that \(\text{tr}\left( \mathbf{Z}\mathbf{Z}^{\mathbf{'}} \right) = tr\left( \mathbf{Z}^{\mathbf{'}}\mathbf{Z} \right)\) where \(\text{tr}\) is the trace operator. The expression \(\text{tr}\left( \mathbf{Z}^{\mathbf{'}}\mathbf{Z} \right)\) is the sum of squared covariables corresponding to effects of alleles “a” and “A”, which occur in \(m\) animals with respective frequencies \(1 - p_{i}\) and \(p_{i}\) in locus \(i\). This is:

\[\mathbf{z}_{i}^{'}\mathbf{z}_{i} = 2m\left\lbrack \left( 1 - p_{i} \right)p_{i}^{2} + p_{i}\left( 1 - p_{i} \right)^{2} \right\rbrack = 2mp_{i}\left( 1 - p_{i} \right) = 2mp_{i}q_{i}\]

Therefore, the diagonal of \(\mathbf{G}\) has an average of

\[\frac{1}{m}\text{tr}\left( \frac{\mathbf{Z}\mathbf{Z}^{\mathbf{'}}}{2\sum p_{i}q_{i}} \right) = \frac{2m\sum p_{i}q_{i}}{2m\sum p_{i}q_{i}} = 1\]

If there is inbreeding there is not Hardy-Weinberg, and there is an inbreeding of \(F\) then the genotypes are distributed according to \(\{ q^{2} + pqF,2pq\left( 1 - F \right),p^{2} + pqF\}\) Falconer and Mackay 1996(). Then we multiply each value of \(z\) by its frequency:

\[\mathbf{z}_{i}^{'}\mathbf{z}_{i} = 2m\left\lbrack \left( 1 - 2p_{i} \right)\left( q_{i}^{2} + p_{i}q_{i}F \right) + \left( 1 - 2p_{i} \right)\left( 2p_{i}q_{i}F \right) + \left( 2 - 2p_{i} \right)\left( p_{i}^{2} + p_{i}q_{i}F \right) \right\rbrack = 2mp_{i}\left( 1 - p_{i} \right) = 2m(1 + F)p_{i}q_{i}\]

The diagonal of \(G\) has in this case an average of

\[\frac{1}{m}\text{tr}\left( \frac{\mathbf{Z}\mathbf{Z}^{\mathbf{'}}}{2\sum p_{i}q_{i}} \right) = \frac{(1 + F)2m\sum p_{i}q_{i}}{2m\sum p_{i}q_{i}} = 1 + F\]

Note that \(F\) here is a within-population inbreeding, and can be negative, indicating excess of homozygosity (e.g., in an F1 population).

11.4.4.4 The average value of the off-diagonal of \(\mathbf{G}\) is almost 0

This is the case if both Hardy-Weinberg and linkage equilibrium hold. If there are \(m\) genotyped animals, we have that the value of the off-diagonal is:

\[\text{avoff}\left( \mathbf{G} \right) = \frac{1}{m\left( m - 1 \right)}\left( \text{sum}\left( \mathbf{G} \right) - diag\left( \mathbf{G} \right) \right) = \frac{m}{m\left( m - 1 \right)} = \frac{1}{m - 1}\]

which is very close to zero.

11.4.5 Weighted Genomic relationships

We have seen that Bayesian Regressions are an option for genomic selection. Somehow, they consider that different markers may have different variances. This can be implemented using

\[Var\left( \mathbf{u} \right) = \mathbf{Z}Var\left( \mathbf{a} \right)\mathbf{Z}^{\mathbf{'}}\mathbf{= ZD}\mathbf{Z}^{\mathbf{'}}\]

Alternatively, and mainly for ease of implementation (e.g., in BLUPF90 or AsReml) this can be obtained factorizing out the genetic variance and using a matrix of weights as in \(Var\left( \mathbf{u} \right) = \mathbf{Z}\mathbf{D}_{w}\mathbf{\ }\mathbf{Z}^{\mathbf{'}}\sigma_{u}^{2}\) with

\[\mathbf{D}_{w}\mathbf{=}\begin{pmatrix} \sigma_{a1}^{2}/\sigma_{a0}^{2} & 0 & \ldots & 0 \\ 0 & \sigma_{a2}^{2}/\sigma_{a0}^{2} & \ldots & 0 \\ \ldots & \ldots & \ldots & \ldots \\ 0 & 0 & \ldots & \sigma_{\text{an}}^{2}/\sigma_{a0}^{2} \\ \end{pmatrix} = \begin{pmatrix} w_{1} & 0 & \ldots & 0 \\ 0 & w_{2} & \ldots & 0 \\ \ldots & \ldots & \ldots & \ldots \\ 0 & 0 & \ldots & w_{n} \\ \end{pmatrix}\]

Note that if \(w_{1} = w_{2} = \ldots = w_{n} = 1\) this is regular genomic relationships.

Marker variances or weights can be obtained in several ways. (Zhang et al. 2010) and (Andrés Legarra et al. 2011) suggested to obtain them from Bayesian Regressions, with good results. (Shen et al. 2013) suggested a REML-like strategy that we evoked before, and (X. Sun et al. 2012) proposed a simple (but seriously biased) algorithm to get SNP-specific variances. Another option is to use VanRaden’s nonLinearA (P. M. VanRaden 2008) to obtain updates for \(\mathbf{D}\).

11.5 Genomic relationships as estimators of realized relationships

The notion of actual or realized relationship is of utmost importance for genomic selection. Pedigree relationships assume an infinitesimal model with infinite unlinked genes. At one locus, two full-sibs may share one, two or none alleles. Across all loci, two full sibs share exactly half their genome in the infinitesimal model. This is no longer true with real chromosomes: chromosomes tend to be transmitted together and therefore two half-sibs may inherit vary different dotations, as shown in the Figure below. The paper of VanRaden (P. M. VanRaden 2007) makes a very good review of the subject.

Different transmission of one chromosome from sire to four half-sibs. Different maternal chromosomes are in black.

In this example, sons 1 and 3 are more alike than sons 2 and 4. Therefore, in prediction of son 3, son 1 should be given more weight than sons 2 and 4. Based on colors, one would say that the relationship of these four sons are something like

\[R = \begin{pmatrix} 1 & 0 & 0.5 & 0.1 \\ 0 & 1 & 0 & 0.4 \\ 0.5 & 0 & 1 & 0.1 \\ 0.1 & 0.4 & 0.1 & 1 \\ \end{pmatrix}\]

These “real” relationships are called realized relationships as opposed to expected relationships. (W. G. Hill and Weir 2011) used the notation \(R_{\text{ij}}\) to the realized relationship, which we will follow. Expressions for the difference between expected \((A_{\text{ij}})\) and realized (\(R_{\text{ij}}\)) relationships were given by (P. M. VanRaden 2007 ; W. G. Hill and Weir 2011 ; Garcia-Cortes et al. 2013) .

In theory, one can define realized relationships in the same way as regular relationships, assuming an unrelated base population, in which case they are identical by descent relationships. In this case,

\[E\left( R_{\text{ij}} \right) = A_{\text{ij}}\]

This important result means that if we simulate meiosis of chromosomes from the sire to the two half-sibs 1 and 2, at each simulation there will be a realized relationship between the two half sibs. This realized relationship will vary between 0 and 0.5, but on average across the simulations it will be 0.25, which is the value of \(A_{\text{ij}}\) .

These deviations are skewed and the ratio deviation/expectation is high for low related animals. This means that two third-degree cousins may actually not share any allele. Markers can see these differences. (Luan et al. 2012) suggested to obtain realized relationships from a pure identity by descent approach, based on computation of probability transmission from parents to offspring with the help of pedigree and markers (R. L. Fernando and Grossman 1989 ; T. Meuwissen and Goddard 2010) , which assumes that founders of the pedigree are unrelated. This has two drawbacks. The first one is that major genes are ignored (because closely associated markers will be ignored). The second one is that computing becomes rather difficult when genotyped animals do not form a complete pedigree (T. Meuwissen and Goddard 2010).

However, Cockerham’s result \(\text{Cov}\left( z_{i},z_{j} \right) = R_{\text{ij}}2\text{pq}\) actually involves realized relationships2. Then, we can reverse the formulae and estimate those realized relationships as \({\widehat{R}}_{\text{ij}} = \text{Cov}\left( z_{i},z_{j} \right)/2\text{pq}\). For instance: consider three individuals and 20 markers, matrix \(\mathbf{Z}\) looks like:

1 1 2 2 1 0 0 1 0 0 2 0 0 2 0 2 2 0 1 1

0 1 2 1 2 1 0 1 2 2 2 2 0 2 1 0 1 0 0 1

2 0 2 0 0 2 1 0 0 0 1 1 0 2 2 1 0 0 0 1

The covariance \(\text{Cov}\left( z_{i},z_{j} \right)\) of individuals 1 and 2 is \(\text{Cov}\left( \mathbf{z}_{1},\mathbf{z}_{2} \right) = 0.11\). This covariance does not depend on gene coding or allele frequencies. But now, but what \(2\text{pq}\) do we divide? We may use the “average” \(\ 2\text{pq}\), in other words, \(\frac{2}{m}\sum p_{i}q_{i}\) where \(m\) is the number of markers. Imagine that frequencies are

0.7 0.31 0.54 0.72 0.83 0.95 0.98 0.84 0.75 0.59 0.37

0.93 0.37 0.79 0.32 0.27 0.14 0.53 0.58 0.78

Then \(\frac{2}{m}\sum p_{i}q_{i} =\) 0.354, and therefore

\[{\widehat{R}}_{\text{ij}} = \frac{\text{Cov}\left( \mathbf{z}_{1},\mathbf{z}_{2} \right)}{2\sum p_{i}q_{i}} = \frac{0.11}{0.354} = 0.31\]

The animals are related (close to “cousins”).

We have just reinvented the wheel. VanRaden’s first \(\mathbf{G}\) is:

\[\mathbf{G =}\frac{\mathbf{Z}\mathbf{Z}^{\mathbf{'}}}{2\sum p_{k}q_{k}}\]

For two individuals, this is

\[\frac{\mathbf{z}_{i}\mathbf{z}_{j}}{2\sum p_{k}q_{k}} = \frac{\sum z_{\text{ik}}z_{\text{jk}}}{2\sum p_{k}q_{k}}\]

But, when using “centered” coding, then \(\sum z_{\text{ik}}z_{\text{jk}}\) is simply \(\sum z_{\text{ik}}z_{\text{jk}} = \text{mCov}\left( \mathbf{z}_{i},\mathbf{z}_{j} \right)\). Thus, VanRaden’s genomic relationships, is an estimator of realized relationship, and an estimator that uses markers to infer relationships. The duality of VanRaden’s formulation using genomic relationships is that at the same time it refers to marker effects and to relationships.

If genomic relationships \(G_{\text{ij}}\) are an unbiased estimator of realized relationships \(R_{\text{ij}}\), then

\[E\left( \mathbf{G} \right) = \mathbf{R}\]

But also, realized relationships are deviations from expected pedigree relationships, and we have that

\[E\left( \mathbf{R} \right) = \mathbf{A}\]

Therefore

\[E\left( \mathbf{G} \right) = \mathbf{A}\]

This raises another question. If realized relationships \(R_{\text{ij}}\) can be defined as IBD relationships, then one should not get negative values. Does this mean that we should turn negative values in \(\mathbf{G}\) to zero? The answer is NO. For individuals that are suspected to have 0 relationships, (\(A_{\text{ij}} = 0)\), this means that \(G_{\text{ij}}\) can oscillate between positive and negative values. However, if we don’t use base allelic frequencies, then \(\mathbf{G}\) is biased with respect to \(\mathbf{A}\) and underestimates relationships.

All these ideas (and many more) were described in depth by (M. Á. Toro, García-Cortés, and Legarra 2011 ; E. A. Thompson 2013).

11.5.1 Other estimators of (genomic) relationships

We can construct a matrix of \(r_{\text{Mij}}\), relationships based on IBS coefficients or “coefficients of similarity” \(\mathbf{G}_{\text{IBS}}\) . The terms in \(\mathbf{G}_{\text{IBS}}\) are usually described in terms of identities or countings:

\(\mathbf{G}_{\text{IB}S_{\text{ij}}} = r_{\text{Mij}} = \frac{1}{n}\sum_{m = 1}^{n}{2\ \frac{\sum_{k = 1}^{2}{\sum_{l = 1}^{2}I_{\text{kl}}}}{4}}\),

where \(I_{\text{kl}}\) measures the identity (with value 1 or 0) of allele \(k\) in individual \(i\) with allele \(l\) in individual \(j\), and single-locus identity measures are averaged across \(k\) loci. It has a nice feature: elements in \(\mathbf{G}_{\text{IBS}}\) are probabilities (contrary to other \(\mathbf{G}\)’s)

In the conservation genetics literature, these are usually called molecular relationships (\(r_{\text{Mij}}\)).

11.5.1.1 Corrected IBS

In the conservation genetics literature, a common technique is to use molecular relationships (\(r_{\text{Mij}}\)) corrected by allelic frequencies, using one of the previous results:

\[{\widehat{R}}_{\text{ij}} = \frac{r_{\text{Mij}} - 2p^{2} - 2q^{2}}{2\text{pq}}\]

There are many variants of this expression (Lynch 1988 ; M. Á. Toro, García-Cortés, and Legarra 2011 ; Ritland 1996) . Extended to several markers (M. Á. Toro, García-Cortés, and Legarra 2011):

\[{\widehat{R}}_{\text{ij}} = \left( \frac{1}{2} \right)\frac{\frac{r_{\text{Mij}}}{2} - {\overline{p}}^{2} - {\overline{q}}^{2} - 2Var(p)}{2\left( \overline{p}\overline{q} - Var(p \right))}\]

How do we compute molecular relationships? Consider the following table that compares two genotypes at a time:

Molecular coancestry (bold) and molecular relationship \(r_{\text{Mij}}\) (italic) comparing two genotypes
AA Aa aa
AA 1 2 0.5 1 0 0
Aa 0.5 1 0.5 1 0.5 1
aa 0 0 0.5 1 1 2

Values in the left columns table give molecular coancestries – the probability that one allele sampled at random from each individual is Identical By State to another allele drawn at sample from the other individual. For instance, two individuals Aa and Aa have a probability of ½ that if we draw one allele from each, the alleles will be identical. You multiply these coancestries by 2 to get molecular relationships \(r_{\text{Mij}}\) on the right columns.

There is a much faster way to get molecular relationships \(r_{\text{Mij}}\) based on \(r_{\text{Mij}} = z_{i}z_{j} - z_{i}{- z}_{j} + 2\) where \(z_{i},\ z_{j}\) are coded as \(\{ 0,1,2\}\). Then it can be shown that the whole array of \(r_{\text{Mij}}\) can be computed at once as a matrix \(\mathbf{G}_{\text{IBS}}\) using (Garcia-Baccino et al. 2017)

\[\left\{ r_{\text{Mij}} \right\} = \mathbf{G}_{\text{IBS}} = \frac{1}{m}\left( \mathbf{Z}_{101}\mathbf{Z}_{101}^{\mathbf{'}} \right) + \mathbf{11'}\text{\ \ }\]

As a crossproduct of \(\{ 0,1,2\}\) matrices \(\mathbf{Z}_{101}\) (we mean by this using coded as \(\{ - 1,0,1\}\) coding), and where \(\mathbf{11}\mathbf{'}\) is a matrix of 1’s. Using the same example as before

1 1 2 2 1 0 0 1 0 0 2 0 0 2 0 2 2 0 1 1

0 1 2 1 2 1 0 1 2 2 2 2 0 2 1 0 1 0 0 1

2 0 2 0 0 2 1 0 0 0 1 1 0 2 2 1 0 0 0 1

with frequencies:

0.7 0.31 0.54 0.72 0.83 0.95 0.98 0.84 0.75 0.59 0.37

0.93 0.37 0.79 0.32 0.27 0.14 0.53 0.58 0.78

Yields \(r_{\text{Mij}} = 1.1\) and \({\widehat{R}}_{\text{ij}} = 2\frac{\frac{r_{\text{Mij}}}{2} - {\overline{p}}^{2} - {\overline{q}}^{2} - 2Var(p)}{2\left( \overline{p}\overline{q} - Var(p \right))} = - 0.50\)

Quite different from the other estimator (you may check the number). However, if many markers are used, all estimators tend to be very similar.

Values of \({\widehat{R}}_{\text{ij}}\) can also be negative, and some set their values to zero. This is a gross mistake, first for the arguments above and second, because it greatly compromises numerical computations (\({\widehat{R}}_{\text{ij}}\) corrected like that do not form a positive definite covariance matrix).

11.5.1.2 VanRaden with 0.5 allele frequencies

One option is to pretend that all frequencies are \(p_{i} = 0.5\). Then VanRaden’s G is constructed using \(\mathbf{Z}_{101}\) (coded as \(\{ - 1,0,1\}\)) and dividing by \(2\sum p_{i}q_{i} = m/2\) with m the number of markers:

\[\mathbf{G}_{05} = \frac{\mathbf{Z}_{101}\mathbf{Z}_{101}^{\mathbf{'}}}{m/2}\mathbf{=}2\frac{\mathbf{Z}_{101}\mathbf{Z}_{101}^{\mathbf{'}}}{m}\]

In turn, (Garcia-Baccino et al. 2017) proved that

\[\mathbf{G}_{\text{IBS}} = \frac{1}{2}\mathbf{G}_{05} + \mathbf{11}^{'}\]

So, the \(\mathbf{G}_{\text{IBS}}\) is basically a particular case of VanRaden’s \(\mathbf{G}\).

11.5.2 Genomic inbreeding

From all G’s, we have a few estimators of genomic inbreeding. For individual \(i\) the genomic inbreeding can be defined as \(G_{\text{ii}} - 1\) and it defines its homozygosity with respect to the assumed allelic frequencies. The genomic inbreeding has a few funny properties:

11.6 Compatibility of genomic and pedigree relationships

VanRaden’s \(\mathbf{G}\) is dependant on the use of base allelic frequencies. For some populations where old ancestors are genotyped (e.g., some populations of dairy cattle), this is feasible. However, this is not the case in many populations. For instance, the Lacaune dairy sheep started recording pedigree and data in the 60’s, while DNA is stored since the 90’s. This causes two problems (that are also problems for Bayesian Regressions):

  1. The genetic base is no longer the same for pedigree and marker. We have seen that, by construction, using “centered” coding leads to an imposed average \(\overline{\mathbf{u}}\mathbf{=}0\) across your population. This is contradictory with the pedigree, which imposes \(\overline{\mathbf{u}}\mathbf{=}0\) only across the founders of the pedigree.

For instance, trying to compare pedigree-based EBV’s and genomic-based EBV’s, they will be a shift in scale. This shift can be accounted for by selecting a group of animals and referring all EBV’s to their average EBV in both cases. Remember that the result of (Ismo Strandén and Christensen 2011) warrants that there will only be a shift in estimates of \(\mathbf{u}\), but the differences across breeding values will be identical.

  1. The genetic variance changes. The pedigree-based genetic variance \(\sigma_{u}^{2}\) refers to the variance of the breeding values of the founders of the pedigree. The marker-based genetic variance \(2\sum p_{i}q_{i}\ \sigma_{a0}^{2}\) refers to the variance of a population with allelic frequencies \(p_{i}\). These are typically “current” observed allele frequencies. However, in a pedigree, markers tend to fix by drift and selection and therefore \(2\sum p_{i}q_{i}\ \sigma_{a0}^{2}\) is lower using current frequencies than base allele frequencies.

Equating \(\sigma_{a0}^{2} = \sigma_{u}^{2}/2\sum p_{i}q_{i}\text{\ \ }\)will tend to underestimate \(\sigma_{a0}^{2}\) . This can be solved if instead of using this expression to obtain \(\sigma_{a0}^{2}\), one estimates \(\sigma_{a0}^{2}\) or marker variances directly, as in BayesC, Bayesian Lasso, or GREML (see later).

These problems are only relevant if one tries to combine pedigree-based information and genomic-based information. In the following, we will use the following notation. \(\mathbf{u}_{\text{base}}\) are the animals of the genetic base of the pedigree (i.e., the founders). \(\mathbf{u}_{2}\) are genotyped animals, and \(\mathbf{u}_{1}\) are ungenotyped animals.

11.6.1 Use of Gengler’s method

Gengler’s method can be used to estimate base allele frequencies (N. Gengler, Mayeres, and Szydlowski 2007 ; P. M. VanRaden 2008). It has, however, been rarely used; one of the reasons is that estimate may go out of bounds (e.g. allelic frequencies beyond 0 or 1).

11.6.2 Compatibility of genetic bases

This is detailed in (Z. Vitezica et al. 2011). If base alleles are not available, one may use current allele frequencies (i.e. frequencies in genotypes of \(\mathbf{u}_{2}\)). We know that, by construction of \(\mathbf{G}\), the mean of \(\mathbf{u}_{2}\) is set to zero: \(p\left( \mathbf{u}_{2} \right) = N(0,\mathbf{G}\sigma_{u}^{2})\). The difference of both means can be modelled as random : \(\mu = {\overline{\mathbf{u}}}_{2} - {\overline{\mathbf{u}}}_{\text{base}} = {\overline{\mathbf{u}}}_{2} = \frac{1}{m}\mathbf{1}^{'}\mathbf{u}_{2}\) where \(m\) is the number of individuals in \(\mathbf{u}_{2}\).

In an infinite population with no selection, there would be no difference between \({\overline{\mathbf{u}}}_{2}\) and \({\overline{\mathbf{u}}}_{\text{base}}\). However, in a finite population there is selection, drift, or both. In this case we can model that \(\mathbf{u}_{2}\) has an a priori mean \(p\left( \mathbf{u}_{2}|\mu \right) = N(\mu,\mathbf{G}\sigma_{u}^{2})\). This mean is actually the result of random factors (selection and drift) and therefore is a random variable with some variance \(\sigma_{\mu}^{2} = a\sigma_{u}^{2}\) (\(a\) was called \(\alpha\) in (Z. Vitezica et al. 2011). Integrating this mean from the expression \(p\left( \mathbf{u}_{2}|\mu \right)p(\mu) = N(\mu,\mathbf{G}\sigma_{u}^{2})N(0,\sigma_{\mu}^{2})\) we have that

\(p\left( \mathbf{u}_{2} \right) = N\left( 0,\mathbf{G}^{*}\sigma_{u}^{2} \right)\)

where \(\mathbf{G}^{*} = (\mathbf{G} + \mathbf{1}\mathbf{1}^{\mathbf{'}}a\mathbf{)}\sigma_{u}^{2}\) is a “tuned” genomic relationship which takes into account our ignorance as to the difference between pedigree and genomic genetic bases. The \(\mathbf{11}\mathbf{'}\) operator simply adds the constant \(a\) to every element of \(\mathbf{G}\). Informally we may write \(\mathbf{G}^{*} = a + \mathbf{G}\).

To obtain a value for \(\sigma_{\mu}^{2}\), we know based on pedigree that the \(Var\left( \mathbf{u}_{2} \right) = \mathbf{A}_{22}\sigma_{u}^{2}\). Therefore \(Var\left( \frac{1}{m}\mathbf{1}^{'}\mathbf{u}_{2} \right) = \frac{1}{m^{2}}\left( \mathbf{1}^{\mathbf{'}}\mathbf{A}_{22}\mathbf{1}\sigma_{u}^{2}\mathbf{\ } \right) = {\overline{\mathbf{A}}}_{22}\sigma_{u}^{2}\text{\ \ }\), where \(\mathbf{A}_{22}\) is the pedigree relationship matrix and the bar means “average over values of \(\mathbf{A}_{22}\)”. Based on genomics, this variance would be \(Var\left( \frac{1}{m}\mathbf{1}^{'}\mathbf{u}_{2} \right) = \frac{1}{m^{2}}\left( \mathbf{1}^{\mathbf{'}}\mathbf{G}\mathbf{1} + \mathbf{1}^{\mathbf{'}}\mathbf{1}\mathbf{1}^{\mathbf{'}}\mathbf{1}a \right)\sigma_{u}^{2} = \left( \overline{\mathbf{G}}\mathbf{+}a \right)\sigma_{u}^{2}\). If we equate both variances, we have that

\[a = {\overline{\mathbf{A}}}_{22}\mathbf{-}\overline{\mathbf{G}}\]

It can be noted that in Hardy-Weinberg equilibrium, \(\overline{\mathbf{G}}\mathbf{=}0\) and \(a = {\overline{\mathbf{A}}}_{22}\).

Adding constant \(a\) as in \(\mathbf{G}^{\mathbf{*}}\mathbf{=}\mathbf{G} + \mathbf{1}\mathbf{1}^{\mathbf{'}}a\) makes, by construction, that both evaluations are in the same scale. This way of getting a value for \(a\) is called method of moments and guarantees unbiasedness. The genetic interpretation is simple. Constructing \(\mathbf{G}\) with current allele frequencies underestimates relationships from the base population. We estimate this underestimation from the average difference between \(\mathbf{G}\) and \(\mathbf{A}_{22}\). Adding a constant to every element of \(\mathbf{G}\) ensures that genomic relationships are, on average, on the same genetic base than pedigree relationships.

11.6.3 Compatibility of genetic variances

In VanRaden’s formulation of \(\mathbf{G}\mathbf{=}\mathbf{Z}\mathbf{Z}^{\mathbf{'}}\mathbf{/}2\sum p_{i}q_{i}\) , the divisor comes because of the assumption that the genetic variance is \(\sigma_{u}^{2} = 2\sum p_{i}q_{i}\sigma_{a0}^{2}\) . However, the product \(2\sum p_{i}q_{i}\) will be too low if we use current allelic frequencies with respect to base allelic frequencies. Therefore, we seek for an adjustment

\[\mathbf{G}^{*} = b\mathbf{G}\]

where \(b\) accounts for the ratio of “current” \(2\sum p_{i}q_{i}\) to “base” \(2\sum p_{i}q_{i}\) and is typically lower than 1 (i.e., the genetic variance has reduced).

The reasoning to solve this issue is as follows. Consider the genetic variance of the genotyped individuals in \(\mathbf{u}_{2}\) ; I will call this \(S_{u2}^{2}\) to stress that this is a variance of a particular population, not the variance of the genetic base. This is \(S_{u2}^{2} = \frac{1}{m}\mathbf{u}_{2}^{'}\mathbf{u}_{2} - {\overline{\mathbf{u}}}_{2}^{2}\) . This \(S_{u2}^{2}\) has a certain distribution under either pedigree or genomic modeling. As we did with genetic bases, we will equate, on expectation, the two \(S_{u2}^{2}\) .

Under pedigree relationships we have that (Searle 1982) p. 355:

\[E\left( S_{u2}^{2} \right) = \left( \frac{1}{m}\text{tr}\left( \mathbf{A}_{22} \right) - {\overline{\mathbf{A}}}_{22} \right)\sigma_{u}^{2} = \left( 1 + {\overline{F}}_{p} - {\overline{\mathbf{A}}}_{22} \right)\sigma_{u}^{2}\]

Under genomic relationships we have that:

\[E\left( S_{u2}^{2} \right) = \left( \frac{1}{m}\text{tr}\left( b\mathbf{G} \right) - b\overline{\mathbf{G}} \right)\sigma_{u}^{2} = b\left( 1 + {\overline{F}}_{g} - \overline{\mathbf{G}} \right)\sigma_{u}^{2}\]

where \({\overline{F}}_{p}\) is average pedigree inbreeding and \({\overline{F}}_{g}\) is average genomic inbreeding. Equating both expectations we have that

\[b = \frac{\left( 1 + {\overline{F}}_{p} - {\overline{\mathbf{A}}}_{22} \right)}{\left( 1 + {\overline{F}}_{g} - \overline{\mathbf{G}} \right)}\]

A close result was showed by (Forni, Aguilar, and Misztal 2011) who had genomic inbreeding. In Hardy-Weinberg conditions, we have seen that \(\overline{\mathbf{G}} = 0\) and \({\overline{F}}_{g} = 0\) (the average diagonal is 1). On the other hand, if matings are at random, \({\overline{F}}_{p} = {\overline{\mathbf{A}}}_{22}/2\). Therefore:

\[b = 1 - {\overline{F}}_{p}\]

And in that case, \(b = 1 - a/2\) above. Which results in \(b < 1\). This means that the genetic variance lowered from the pedigree base to the genotyped population, and by an amount (as predicted by the theory) of \(1 - {\overline{F}}_{p}\). Thus, the multiplication by \(b\) corrects for the fixation of alleles due to inbreeding.

11.6.4 Compatibility of genetic bases and variances

With the two pieces above, it is easy to see that a compatible matrix \(\mathbf{G}^{*} = a + b\mathbf{G}\) can be obtained by the expressions above for \(a\) and \(b\). (Z. Vitezica et al. 2011) based on (Powell, Visscher, and Goddard 2010) observed that relationships in a “recent” population in an “old” population scale can be modelled using Wright’s fixation indexes. Translated to our context, this gives \(a = {\overline{\mathbf{A}}}_{22}\) and \(b = 1 - \frac{a}{2}\), which is the same result as above if Hardy-Weinberg holds.

Christensen et al. (2012) remarked that the hypothesis of random mating population is not likely for the group of genotyped animals, since they would born in different years and some being descendants of others, and suggested to infer \(a\) and \(b\) from the system of two equations equating average relationships and average inbreeding: \(\frac{\text{tr}\left( \mathbf{G} \right)}{m}b + a = \frac{\text{tr}\left( \mathbf{A}_{22} \right)}{m}\) and \(a + b\overline{\mathbf{G}}\mathbf{=}{\overline{\mathbf{A}}}_{22}\) . This is basically a development as above. They further noticed that in practice \(b \approx 1 - a/2\) because the deviation from Hardy-Weinberg was small.

VanRaden (2008) suggested a regression of observed on expected relationships, minimizing the residuals of \(a + b\mathbf{G}\mathbf{=}\mathbf{A}_{22} + \mathbf{E}\). This idea was generalized to several breed origins by (Harris and Johnson 2010). The distribution of \(\mathbf{E}\) is not homoscedastic and this precluded scholars from trying this approach because it would be sensible to extreme values (O. Christensen et al. 2012), e.g., if many far relatives are included, for which the deviations in \(\mathbf{E}\) can be very large.

Finally, (O. Christensen et al. 2012) argued that relationships in \(\mathbf{G}\) do not depend on pedigree depth, and they are exact in some sense. He suggested to take as reference the 101 coding (i.e., set the frequencies to 0.5) and then “tune” pedigree relationships in \(\mathbf{A}\) to match genomic relationships in \(\mathbf{G}\). He introduced two extra parameters, \(\gamma\) and \(s\). The \(\gamma\) parameter can be understood as the overall relationship across the base population such that current genotypes are most likely, and integrates the fact that the assumption of unrelatedness at the base population is false in view of genomic results (two animals who share alleles at markers are related even if the pedigree is not informative). More precisely, he devised a new pedigree relationship matrix, \(\mathbf{A}\left( \gamma \right)\) whose founders have a relationship matrix \(\mathbf{A}_{\text{base}} = \gamma + \mathbf{I}(1 - \gamma/2)\). Parameter \(s\), used in \(\mathbf{G}\mathbf{=}\mathbf{Z}\mathbf{Z}^{\mathbf{'}}/s\) can be understood as the counterpart of \(2\Sigma p_{i}q_{i}\) (heterozygosity of the markers) in the base generation. Both parameters can be deduced from maximum likelihood. This model is the only one which accounts for all the complexities of pedigrees (former ones are based on average relationships) but it has not been tested with real data so far.

11.7 Singularity of G

Matrix \(\mathbf{G}\) might (and usually is) singular. There are two reasons for this. First, if there are clones or identical twins, two genotypes in \(\mathbf{Z}\) will be identical and therefore two animals will show a correlation of exactly 1 in \(\mathbf{G}\). Second, if genotypes in \(\mathbf{Z}\) use “centered” coding with observed allele frequencies, then the matrix is singular (last row can be predicted from the other ones) (Ismo Strandén and Christensen 2011).

To obtain an invertible \(\mathbf{G}\) and then use \(\mathbf{G}^{- 1}\) in the mixed model equations, there are two ways. The first one is to use a modified \(\mathbf{G}_{w}\mathbf{=}\left( 1 - \alpha \right)\mathbf{G} + \alpha\mathbf{I}\mathbf{\ }\),with \(\alpha\) a small value (typically 0.05 or 0.01). The second option consists in mixing genomic and pedigree relationships. If \(\mathbf{A}_{22}\) is the matrix of genotyped animals, we might use a modified “weighted” \(\mathbf{G}_{w}\mathbf{=}\left( 1 - \alpha \right)\mathbf{G} + \alpha\mathbf{A}_{22}\) . This is the default in the Blupf90 package, which uses \(\alpha = 0.05\). A more detailed explanation is in the next section.

11.8 Including residual polygenic effects in G

One may consider that not all genetic variance is captured by markers. This can be shown by estimating variance assigned to markers and pedigree (Andrés Legarra et al. 2008 ; Rodríguez-Ramilo, García-Cortés, and González-Recio 2014 ; Jensen, Su, and Madsen 2012 ; O. F. Christensen and Lund 2010) or because some genomic evaluation procedures give better cross-validation results when an extra polygenic term based exclusively on pedigree relationships is added (e.g. (Su, Madsen, et al. 2012)).

Let us decompose the breeding values of genotyped individuals in a part due to markers and a residual part due to pedigree, \(\mathbf{u}\mathbf{=}\mathbf{u}_{m}\mathbf{+}\mathbf{u}_{p}\) with respective variances \(\sigma_{u}^{2} = \sigma_{u,m}^{2} + \sigma_{u,p}^{2}\).

It follows that \(Var\left( \mathbf{u}_{2} \right) = \left( \left( 1 - \alpha \right)\mathbf{G}\mathbf{+}{\alpha\mathbf{A}}_{22} \right)\sigma_{u}^{2}\) where \(\alpha = \sigma_{u,p}^{2}/\sigma_{u}^{2}\) is the ratio of pedigree-based variance to total variance.

Therefore, the simplest way to include the residual polygenic effects is to create a modified genomic relationship matrix \(\mathbf{G}_{w}\) (\(\mathbf{G}\) in (I. Aguilar et al. 2010); \(\mathbf{G}_{w}\) in (P. M. VanRaden 2008 ; O. Christensen et al. 2012) as \(\mathbf{G}_{w} = \left( 1 - \alpha \right)\mathbf{G}\mathbf{+}\alpha\mathbf{A}_{22}\). In practice, the value of \(\alpha\) is low and has negligible effects on predictions.

11.9 Multiallelic genomic relationships

In population genetics there are several methods to estimate (pedigree) relationship matrices through markers; these methods were proposed basically in conservation genetics (Ritland 1996; Caballero and Toro 2002). These methods are not very satisfying because they need parameters such as base population allele frequencies that are elusive.

Thus swe would be happy if we could extend VanRaden’s \(\mathbf{G}\) to the multiallelic case. We (Marchal et al. 2016) developed it as an extension of the Multiple marker regression model.

Imagine that each allele at each locus produces an effect. We saw such a model in section 9:

12 GBLUP

12.1 Single trait animal model GBLUP

With genomic relationships well defined in the previous section as (rather generally) \(Var\left( \mathbf{u} \right) = \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} = \mathbf{Z}\mathbf{D}_{w}\mathbf{Z}^{\mathbf{'}}\sigma_{u}^{2}\mathbf{=}\mathbf{G}\sigma_{u}^{2}\) (and perhaps after some compatibility “tuning” as before), the construction of genomic predictions in GBLUP form is straightforward. We have the following linear model:

\[\mathbf{y = Xb + Wu + e}\]

where \(\mathbf{W}\) is a matrix linking phenotypes to individuals. Then \(Var\left( \mathbf{u} \right) = \mathbf{G}\sigma_{u}^{2}\), \(Var\left( \mathbf{e} \right) = \mathbf{R}\). We may also assume multivariate normality. Under these assumptions, Best Predictions, or Conditional Expectations, of breeding values in \(\mathbf{u}\) can be obtained by Henderson’s mixed model equations as:

\[\begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{X} & \mathbf{X}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{W} \\ \mathbf{W}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{X} & \mathbf{W}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{\ W +}\mathbf{G}^{- 1}\sigma_{u}^{- 2}\ \\ \end{pmatrix}\begin{pmatrix} \widehat{\mathbf{b}} \\ \widehat{\mathbf{u}} \\ \end{pmatrix} = \begin{pmatrix} \mathbf{X}^{'}\mathbf{R}^{- 1}\mathbf{y} \\ \mathbf{W}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{y} \\ \end{pmatrix}\]

If \(\mathbf{R} = \mathbf{I}\sigma_{e}^{2}\), then the variance components can be factored out and the equations become:

\[\begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{X} & \mathbf{X}^{\mathbf{'}}\mathbf{W} \\ \mathbf{W}^{\mathbf{'}}\mathbf{X} & \mathbf{W'W + I}\lambda \\ \end{pmatrix}\begin{pmatrix} \widehat{\mathbf{b}} \\ \widehat{\mathbf{u}} \\ \end{pmatrix} = \begin{pmatrix} \mathbf{X}^{'}\mathbf{y} \\ \mathbf{W}^{\mathbf{'}}\mathbf{y} \\ \end{pmatrix}\]

with \(\lambda = \sigma_{e}^{2}/\sigma_{u}^{2}\) .

These equations are identical to regular animal model, with the exception that genomic relationships \(\mathbf{G}\) are used instead of pedigree relationships. They have some very nice features:

  1. Any model that has been developed in BLUP can be immediately translated into GBLUP. This includes maternal effects model, random regression, competition effect models, multiple trait, etc.

  2. All genotyped individuals can be included, either with phenotype or not. The only difference is that the corresponding element in \(\mathbf{W}\) is set to 0.

  3. Regular software (blupf90, asreml, wombat…) works if we include a mechanism to include \(\mathbf{G}^{- 1}\).

  4. Developments including mixed model equations apply to GBLUP as well. Therefore, GREML and G-Gibbs are simple extensions.

12.2 Multiple trait GBLUP

This is straightforward as well. The multiple trait mixed model equations are:

\[\begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{X} & \mathbf{X}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{W} \\ \mathbf{W}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{X} & \mathbf{W}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{\ W +}\mathbf{G}^{- 1}\bigotimes\mathbf{G}_{0}^{- 1}\ \\ \end{pmatrix}\begin{pmatrix} \widehat{\mathbf{b}} \\ \widehat{\mathbf{u}} \\ \end{pmatrix} = \begin{pmatrix} \mathbf{X}^{'}\mathbf{R}^{- 1}\mathbf{y} \\ \mathbf{W}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{y} \\ \end{pmatrix}\]

where \(\mathbf{G}_{0}\) is the matrix of genetic covariance across traits, and usually \(\mathbf{R}\mathbf{=}\mathbf{I}\bigotimes\mathbf{R}_{0}\), where \(\mathbf{R}_{0}\) is the matrix of residual covariances. Note that these equations work perfectly well with missing traits.

12.3 Reliabilities from GBLUP

Nominal, also called model-based, reliabilities (NOT cross-validation reliabilities) can be obtained from the Mixed Model equations, as:

\[Rel_{i} = 1 - \frac{C^{\text{ii}}}{G_{\text{ii}}\sigma_{u}^{2}}\]

where \(C^{\text{ii}}\) is the \(i,i\) element of the inverse of the mixed model equations in its first form (i.e., with explicit \(\sigma_{u}^{2}\)). However, there is a word of caution. Depending how the coding of \(\mathbf{Z}\) proceeds, the numerical values of \(Rel_{i}\) change, although EBV’s only shift by a constant (Ismo Strandén and Christensen 2011). This result is problematic because reporting reliabilities becomes tricky. Recently, (Tier, Meyer, and Swan 2018) suggested to include the base population as an extra individual, which automatically sets all reliabilities on the same scale.

12.4 All Genomic relationships are equal

We saw before, in Bayesian Regressions, that changing the coding of Z to “centered”, 101, 012, 021 gave same results. These results apply, partly, to GBLUP and they can be summarized as follows. Consider the most frequent

\[\frac{\mathbf{Z}\mathbf{Z}^{\mathbf{'}}}{2\sum p_i q_i }\sigma_{u}^{2}\mathbf{= Z}\mathbf{Z}^{\mathbf{'}}\frac{\sigma_{u}^{2}}{2\sum p_i q_i}\mathbf{= G}\sigma_{u}^{2}\]

This matrix G is affected by coding and allele frequencies in two places:

All ways of centering give the same \(\widehat{\mathbf{u}}\), shifted by a constant, provided that \(\frac{\sigma_{u}^{2}}{2\sum p_i q_i }\) is constant. For instance:

  1. If to construct \(\mathbf{G}\) we use any coding in \(\mathbf{Z}\mathbf{Z}^{\mathbf{'}}\) but we keep \(2\sum p_i q_i\) the same, then \(\widehat{\mathbf{u}}\) will be the same

  2. If we construct \(\mathbf{G}_{05} = \frac{\mathbf{Z}_{101}\mathbf{Z}_{101}^{\mathbf{'}}}{m/2}\), but then we use as genetic variance \(\frac{\sigma_{u}^{2}}{2\sum p_i q_i}\frac{m}{2}\), then \(\widehat{\mathbf{u}}\) will be the same

  3. If we estimate genetic variance by REML, then EBVs will be the same but the estimated genetic variance is not necessarily the same.

These properties of G do not hold for SSGBLUP – we will see that later.

12.5 GBLUP with singular G

If \(\mathbf{G}\) is singular, one can use alternative mixed model equations (Harville 1976 ; C. R. Henderson 1984):

\[\begin{pmatrix} \mathbf{X^{'}R^{- 1}X} & \mathbf{X^{'}R^{-1}W} \\ \mathbf{G}\sigma_{u}^{2}\mathbf{W^{'}R^{-1}X} & \mathbf{G}\sigma_{u}^{2}\mathbf{W}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{\ W + I}\ \\ \end{pmatrix}\begin{pmatrix} \widehat{\mathbf{b}} \\ \widehat{\mathbf{u}} \\ \end{pmatrix} = \begin{pmatrix} \mathbf{X}^{'}\mathbf{R}^{- 1}\mathbf{y} \\ \mathbf{G}\sigma_{u}^{2}\mathbf{W}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{y} \\ \end{pmatrix}\]

Or a symmetric form that fits better into regular algorithms. First, we predict an auxiliary vector \(\mathbf{\alpha}\):

\[\begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{X} & \mathbf{X}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{WG}\sigma_{u}^{2} \\ \mathbf{G}\sigma_{u}^{2}\mathbf{W}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{X} & \mathbf{G}\sigma_{u}^{2}\mathbf{W}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{ WG}\sigma_{u}^{2}\mathbf{+ G}\sigma_{u}^{2}\ \\ \end{pmatrix}\begin{pmatrix} \widehat{\mathbf{b}} \\ \widehat{\boldsymbol{\alpha}} \\ \end{pmatrix} = \begin{pmatrix} \mathbf{X}^{'}\mathbf{R}^{- 1}\mathbf{y} \\ \mathbf{G}\sigma_{u}^{2}\mathbf{W}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{y} \\ \end{pmatrix}\]

From this, \(\widehat{\mathbf{u}}\mathbf{=}\mathbf{G}\sigma_{u}^{2}\widehat{\boldsymbol{\alpha}}\) .

12.6 Backsolving from GBLUP to marker estimates

Because \(\mathbf{G}\) is formed from marker effects, the algebra warrants that estimates are the same under either GBLUP or BLUP-SNP (P. M. VanRaden 2008), provided that parameterizations are strictly identical (same \(\mathbf{Z}\), same \(p\)’s, same variances, etc). This is up to the numerical error produced by forcing \(\mathbf{G}\) to be invertible; this numerical error is most often negligible. More formal proofs can be found in (C. R. Henderson 1973 ; I. Strandén and Garrick 2009). We present here how to obtain marker effects.

If breeding values \(\mathbf{u}\mathbf{=}\mathbf{Za}\) and \(Var\left( \mathbf{a} \right) = \mathbf{D}\), then the joint distribution of breeding values \(\mathbf{u}\) and marker effects \(\mathbf{a}\) is (C. R. Henderson 1973 ; I. Strandén and Garrick 2009):

\[Var\begin{pmatrix} \mathbf{u} \\ \mathbf{a} \\ \end{pmatrix} = \begin{pmatrix} \mathbf{ZDZ'} & \mathbf{ZD} \\ \mathbf{D}\mathbf{Z}^{'} & \mathbf{D} \\ \end{pmatrix}\]

where, usually, \(\mathbf{D} = \mathbf{I}\sigma_{u}^{2}/2\Sigma p_{i}q_{i}\). Assuming multivariate normality,

\[p\left( \mathbf{u} \middle| \mathbf{a} \right) = N(\mathbf{Za},\mathbf{0})\]

which means that if marker effects are known, then breeding values are exactly known, and their estimate is simply:

\[\widehat{\mathbf{u}}\mathbf{\ |}\widehat{\mathbf{a}}\mathbf{=}E\left( \mathbf{u} \middle| \mathbf{a =}\widehat{\mathbf{a}} \right)\mathbf{=}\mathbf{Z}^{\mathbf{'}}\widehat{\mathbf{a}}\]

(the breeding value is the sum of marker effects).

However, the opposite is not necessarily true and, conditional on breeding values marker effects can have several possible values:

\[p\left( \mathbf{a} \middle| \mathbf{u} \right) = N\left( \mathbf{D}\mathbf{Z}^{\mathbf{'}}\left( \mathbf{Z}\mathbf{D}\mathbf{Z}^{\mathbf{'}} \right)^{- 1}\mathbf{u,D - D}\mathbf{Z}^{\mathbf{'}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{\mathbf{- 1}}\mathbf{ZD} \right)\]

Thus, the estimate of marker effects conditional on breeding values has a conditional mean. If \(\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)\mathbf{=}\mathbf{G}\sigma_{u}^{2}\) then:

\[\widehat{\mathbf{a}}\mathbf{|}\widehat{\mathbf{u}}\mathbf{=}E\left( \mathbf{a} \middle| \mathbf{u =}\widehat{\mathbf{u}} \right)\mathbf{= D}\mathbf{Z}^{\mathbf{'}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{- 1}\mathbf{\ }\widehat{\mathbf{u}}\mathbf{= D\ }\mathbf{Z}^{\mathbf{'}}\mathbf{\ }\mathbf{G}^{- 1}\ \sigma_{u}^{- 2}\mathbf{\ }\widehat{\mathbf{u}}\]

or, if, \(\mathbf{D} = \mathbf{I}\sigma_{u}^{2}/2\Sigma p_{i}q_{i}\):

\[\widehat{\mathbf{a}}\mathbf{|}\widehat{\mathbf{u}}\mathbf{=}\frac{1}{2\Sigma p_{i}q_{i}}\mathbf{Z}^{\mathbf{'}}\mathbf{\ }\mathbf{G}^{- 1}\ \mathbf{\ }\widehat{\mathbf{u}}\]

with associated variance

\[Var\left( \mathbf{a} \middle| \mathbf{u =}\widehat{\mathbf{u}} \right) = \mathbf{D - D}\mathbf{Z}^{\mathbf{'}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{\mathbf{- 1}}\mathbf{ZD}\]

Note that \(\mathbf{D}\mathbf{-}\mathbf{D}{\mathbf{\ }\mathbf{Z}}^{\mathbf{'}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{\mathbf{-}\mathbf{1}}\mathbf{\ }\mathbf{ZD}\) maybe semipositive definite (not invertible) i.e., if two markers are in complete LD. This variance ignores that \(\widehat{\mathbf{u}}\) is an estimate.

12.7 Backsolving when matrix G has been “tuned”

In general, the matrix \(\mathbf{G}\) has undergone some “tuning” (1) to be invertible, (2) to be on the same scale as pedigree relationships. Usually this is:

\[\mathbf{G}^{\text{tuned}}\mathbf{=}\left( 1 - \alpha \right)\left( \mathbf{11}^{\mathbf{'}}a + b\mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)\mathbf{+}\alpha\mathbf{A}_{22}\]

where the coefficient \(a\) adds the extra “average relationship” and at the same time models the difference \(\mu\) from pedigree base to genomic base (Z. Vitezica et al. 2011 ; Hsu, Garrick, and Fernando 2017) , the coefficient \(b\) considers the reduction in variance due to drift, and \(\alpha\) is the part of genetic variance assigned to (pedigree) residual polygenic effects.

In order to extract correctly the marker effects, we need to take that into account. The expression above can be rewritten as:

\[\mathbf{u =}\mathbf{u}_{m} + \mathbf{u}_{p}\]

\[\mathbf{u}_{m} = \mathbf{1}\mu + \mathbf{u}_{m}^{*}\]

where \(\mathbf{u}_{p}\) are “pedigree” BVs, \(\mathbf{u}_{m}\) are “marker” breeding values put (shifted) on a pedigree scale (that of \(\mathbf{A})\) and \(\mathbf{u}_{m}^{*}\) are “marker” breeding values put (shifted) on a genomic scale. The respective variances are:

\[Var\left( \mathbf{u}_{p} \right) = \mathbf{A}_{22}\sigma_{u}^{2}\alpha\]

\[Var\left( \mathbf{u}_{m}^{*} \right) = b\mathbf{ZD}\mathbf{Z}^{\mathbf{'}}\left( 1 - \alpha \right)\]

which implies that, in fact, we have reduced the variance of marker effects from an a priori variance of \(\mathbf{D}\) to another variance of \(Var\left( \mathbf{a} \right) = \ b\left( 1 - \alpha \right)\mathbf{D}\).

Finally,

\[Var\left( \mu \right) = a\left( 1 - \alpha \right)\sigma_{u}^{2}\]

From these elements, we retrieve that \(Var\left( \mathbf{u} \right) = Var\left( \mathbf{u}_{p} + \mathbf{u}_{m}^{*} + \mathbf{1}\mu \right) = \mathbf{G}^{\text{tuned}}\)

From here we can derive the covariance structure:

\[Var\begin{pmatrix} \mu \\ \mathbf{u}_{p} \\ \mathbf{u}_{m}^{*} \\ \mathbf{u} \\ \mathbf{a} \\ \end{pmatrix} = \begin{pmatrix} a\left( 1 - \alpha \right)\sigma_{u}^{2} & 0 & 0 & a\left( 1 - \alpha \right)\mathbf{1'}\sigma_{u}^{2} & 0 \\ 0 & \mathbf{A}_{22}\sigma_{u}^{2}\alpha & 0 & \mathbf{A}_{22}\sigma_{u}^{2}\alpha & 0 \\ 0 & 0 & b\left( 1 - \alpha \right)\mathbf{ZD}\mathbf{Z}^{\mathbf{'}} & b\left( 1 - \alpha \right)\mathbf{ZD}\mathbf{Z}^{\mathbf{'}} & b\left( 1 - \alpha \right)\mathbf{ZD} \\ a\left( 1 - \alpha \right)\mathbf{1}\sigma_{u}^{2} & \mathbf{A}_{22}\sigma_{u}^{2}\alpha & b\left( 1 - \alpha \right)\mathbf{ZD}\mathbf{Z}^{\mathbf{'}} & \left( 1 - \alpha \right)\left( \mathbf{11}^{\mathbf{'}}a\sigma_{u}^{2} + b\mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)\mathbf{+}\alpha\mathbf{A}_{22}\sigma_{u}^{2} & b\left( 1 - \alpha \right)\mathbf{ZD} \\ 0 & 0 & b\left( 1 - \alpha \right)\mathbf{D}\mathbf{Z}^{'} & b\left( 1 - \alpha \right)\mathbf{D}\mathbf{Z}^{'} & b\left( 1 - \alpha \right)\mathbf{D} \\ \end{pmatrix}\]

Under the usual assumption \(\mathbf{D} = \mathbf{I}\sigma_{u}^{2}/2\Sigma p_{i}q_{i}\) we put \(\sigma_{u}^{2}\) as common term:

\[Var\begin{pmatrix} \mu \\ \mathbf{u}_{p} \\ \mathbf{u}_{m}^{*} \\ \mathbf{u} \\ \mathbf{a} \\ \end{pmatrix} = \begin{pmatrix} a\left( 1 - \alpha \right)\sigma_{u}^{2} & 0 & 0 & a\left( 1 - \alpha \right)\mathbf{1}^{\mathbf{'}}\sigma_{u}^{2} & 0 \\ 0 & \mathbf{A}_{22}\sigma_{u}^{2}\alpha & 0 & \mathbf{A}_{22}\sigma_{u}^{2}\alpha & 0 \\ 0 & 0 & b\left( 1 - \alpha \right)\mathbf{Z}\mathbf{Z}^{\mathbf{'}}\frac{1}{2\Sigma p_{i}q_{i}}\sigma_{u}^{2} & b\left( 1 - \alpha \right)\mathbf{Z}\mathbf{Z}^{\mathbf{'}}\frac{1}{2\Sigma p_{i}q_{i}}\sigma_{u}^{2} & b\left( 1 - \alpha \right)\mathbf{Z}\frac{1}{2\Sigma p_{i}q_{i}}\sigma_{u}^{2} \\ a\left( 1 - \alpha \right)\mathbf{1}\sigma_{u}^{2} & \mathbf{A}_{22}\sigma_{u}^{2}\alpha & b\left( 1 - \alpha \right)\mathbf{Z}\mathbf{Z}^{\mathbf{'}}\frac{1}{2\Sigma p_{i}q_{i}}\sigma_{u}^{2} & \left( 1 - \alpha \right)\left( \mathbf{11}^{\mathbf{'}}a\sigma_{u}^{2} + b\mathbf{Z}\mathbf{Z}^{\mathbf{'}}\frac{1}{2\Sigma p_{i}q_{i}}\sigma_{u}^{2} \right)\mathbf{+}\alpha\mathbf{A}_{22}\sigma_{u}^{2} & b\left( 1 - \alpha \right)\mathbf{Z}\frac{1}{2\Sigma p_{i}q_{i}}\sigma_{u}^{2} \\ 0 & 0 & b\left( 1 - \alpha \right)\mathbf{Z}^{'}\frac{1}{2\Sigma p_{i}q_{i}}\sigma_{u}^{2} & b\left( 1 - \alpha \right)\mathbf{Z}^{'}\frac{1}{2\Sigma p_{i}q_{i}}\sigma_{u}^{2} & b\left( 1 - \alpha \right)\frac{1}{2\Sigma p_{i}q_{i}}\mathbf{I}\sigma_{u}^{2} \\ \end{pmatrix}\]

\[Var\begin{pmatrix} \mu \\ \mathbf{u}_{p} \\ \mathbf{u}_{m}^{*} \\ \mathbf{u} \\ \mathbf{a} \\ \end{pmatrix} = \begin{pmatrix} a\left( 1 - \alpha \right) & 0 & 0 & a\left( 1 - \alpha \right)\mathbf{1'} & 0 \\ 0 & \mathbf{A}_{22}\alpha & 0 & \mathbf{A}_{22}\alpha & 0 \\ 0 & 0 & b\left( 1 - \alpha \right)\mathbf{Z}\mathbf{Z}^{\mathbf{'}}\frac{1}{2\Sigma p_{i}q_{i}} & b\left( 1 - \alpha \right)\mathbf{Z}\mathbf{Z}^{\mathbf{'}}\frac{1}{2\Sigma p_{i}q_{i}} & b\left( 1 - \alpha \right)\mathbf{Z}\frac{1}{2\Sigma p_{i}q_{i}} \\ a\left( 1 - \alpha \right)\mathbf{1} & \mathbf{A}_{22}\alpha & b\left( 1 - \alpha \right)\mathbf{Z}\mathbf{Z}^{\mathbf{'}}\frac{1}{2\Sigma p_{i}q_{i}} & \left( 1 - \alpha \right)\left( \mathbf{11}^{\mathbf{'}}a + b\mathbf{Z}\mathbf{Z}^{\mathbf{'}}\frac{1}{2\Sigma p_{i}q_{i}} \right)\mathbf{+}\alpha\mathbf{A}_{22} & b\left( 1 - \alpha \right)\mathbf{Z}\frac{1}{2\Sigma p_{i}q_{i}} \\ 0 & 0 & b\left( 1 - \alpha \right)\mathbf{Z}^{'}\frac{1}{2\Sigma p_{i}q_{i}} & b\left( 1 - \alpha \right)\mathbf{Z}^{'}\frac{1}{2\Sigma p_{i}q_{i}} & b\left( 1 - \alpha \right)\mathbf{I}\frac{1}{2\Sigma p_{i}q_{i}} \\ \end{pmatrix}\sigma_{u}^{2}\]

Let’s call \(\mathbf{Z}\mathbf{Z}^{\mathbf{'}}\frac{1}{2\Sigma p_{i}q_{i}}\mathbf{=}\mathbf{G}^{\text{untuned}}\) . Also, \(\mathbf{G}^{\text{tuned}}\mathbf{=}\left( 1 - \alpha \right)\left( \mathbf{11}^{\mathbf{'}}a + b\mathbf{Z}\mathbf{Z}^{\mathbf{'}}\frac{1}{2\Sigma p_{i}q_{i}} \right)\mathbf{+}\alpha\mathbf{A}_{22}\). Thus:

\[Var\begin{pmatrix} \mu \\ \mathbf{u}_{p} \\ \mathbf{u}_{m}^{*} \\ \mathbf{u} \\ \mathbf{a} \\ \end{pmatrix} = \begin{pmatrix} a\left( 1 - \alpha \right) & 0 & 0 & a\left( 1 - \alpha \right)\mathbf{1}^{\mathbf{'}} & 0 \\ 0 & \mathbf{A}_{22}\alpha & 0 & \mathbf{A}_{22}\alpha & 0 \\ 0 & 0 & b\left( 1 - \alpha \right)\mathbf{G}^{\text{untuned}} & b\left( 1 - \alpha \right)\mathbf{G}^{\text{untuned}} & b\left( 1 - \alpha \right)\mathbf{Z}\frac{1}{2\Sigma p_{i}q_{i}} \\ a\left( 1 - \alpha \right)\mathbf{1} & \mathbf{A}_{22}\alpha & b\left( 1 - \alpha \right)\mathbf{G}^{\text{untuned}} & \mathbf{G}^{\text{tuned}} & b\left( 1 - \alpha \right)\mathbf{Z}\frac{1}{2\Sigma p_{i}q_{i}} \\ 0 & 0 & b\left( 1 - \alpha \right)\mathbf{Z}^{'}\frac{1}{2\Sigma p_{i}q_{i}} & b\left( 1 - \alpha \right)\mathbf{Z}^{'}\frac{1}{2\Sigma p_{i}q_{i}} & b\left( 1 - \alpha \right)\mathbf{I}\frac{1}{2\Sigma p_{i}q_{i}} \\ \end{pmatrix}\sigma_{u}^{2}\]

And from here, we can get the equations for backsolving (the factor \(\sigma_{u}^{2}\) cancels out):

\[\widehat{\mu}\mathbf{|}\widehat{\mathbf{u}}\mathbf{=}E\left( \mu \middle| \mathbf{u =}\widehat{\mathbf{u}} \right)\mathbf{=}a\left( 1 - \alpha \right)\mathbf{1'}{\mathbf{G}^{\mathbf{\text{tuned}}}}^{- 1}\mathbf{\ }\widehat{\mathbf{u}}\]

\[{\widehat{\mathbf{u}}}_{p}\mathbf{|}\widehat{\mathbf{u}}\mathbf{=}\mathbf{A}_{\mathbf{22}}\alpha{\mathbf{G}^{\mathbf{\text{tuned}}}}^{- 1}\mathbf{\ }\widehat{\mathbf{u}}\]

\[{\widehat{\mathbf{u}}}_{m}^{\mathbf{*}}\mathbf{|}\widehat{\mathbf{u}}\mathbf{=}b\left( 1 - \alpha \right)\mathbf{G}^{\mathbf{\text{untuned}}}{\mathbf{G}^{\mathbf{\text{tuned}}}}^{- 1}\mathbf{\ }\widehat{\mathbf{u}}\]

\[\widehat{\mathbf{a}}\mathbf{|}\widehat{\mathbf{u}}\mathbf{=}b\left( 1 - \alpha \right)\mathbf{Z'}\frac{1}{2\Sigma p_{i}q_{i}}{\mathbf{G}^{\mathbf{\text{tuned}}}}^{- 1}\mathbf{\ }\widehat{\mathbf{u}}\]

\[{\widehat{\mathbf{u}}}_{m}^{\mathbf{*}}\mathbf{|}\widehat{\mathbf{a}}\mathbf{= Z}\widehat{\mathbf{a}}\]

which upon substituting \(\widehat{\mathbf{a}}\) is strictly identical to \({\widehat{\mathbf{u}}}_{m}^{\mathbf{*}}\) above

\[{\widehat{\mathbf{u}}}_{m} = \widehat{\mu} + {\widehat{\mathbf{u}}}_{m}^{\mathbf{*}}\]

12.8 Backsolving with metafounders

When metafounders are used, there are some differences:

  1. matrix \(\mathbf{G}\) is not “tuned” so there is no implicit \(\mu\)
  2. allele frequencies to construct \(\mathbf{G}\) are assumed to be all \(p_i=0.5\)
  3. Pedigree relationships are \(\mathbf{A}_{\Gamma 22}\)

Thus the prior covariance of marker effects \(\mathbf{D}\) is \(\mathbf{D}=\mathbf{I}\frac{1}{2\sum{p_i q_i}}=\mathbf{I}\frac{2}{m}\) with \(m\) number of markers and \(\mathbf{G}=\mathbf{ZD}\mathbf{Z}'=\frac{2}{m}\mathbf{ZZ'}\) with \(\mathbf{Z}\) coded as {-1,0,1} for the three genotypes.

In fact, the role of difference between the genomic base and the pedigree base is played by the metafounder solution, which gives the difference between the genetic level of an ideal genotyped population with \(p=0.5\) and the animals at the base population represented by the pedigree.

However, there may still be some “blending” as

\[\mathbf{G}^{\text{blended}}=\left( 1 - \alpha \right)\mathbf{G}+\alpha\mathbf{A}_{\Gamma 22}=\left( 1 - \alpha \right)\left( \frac{2}{m}\mathbf{ZZ'} \right)\mathbf{+}\alpha\mathbf{A}_{\Gamma 22}\]

and the equations are like above but with \(a=0\),\(b=1\), and there is no difference between pedigree and genomic bases:

\[{\widehat{\mathbf{u}}}_{p}\mathbf{|}\widehat{\mathbf{u}}\mathbf{=}\alpha\mathbf{A}_{\Gamma 22}{\mathbf{G}^{\mathbf{\text{blended}}}}^{- 1}\mathbf{\ }\widehat{\mathbf{u}}\]

\[{\widehat{\mathbf{u}}}_{m}\mathbf{|}\widehat{\mathbf{u}}\mathbf{=}\left( 1 - \alpha \right)\left( \frac{2}{m}\mathbf{ZZ'} \right){\mathbf{G}^{\mathbf{\text{blended}}}}^{- 1}\mathbf{\ }\widehat{\mathbf{u}}\]

Note that if you sum \(\mathbf{\hat u}_{p}\) and \(\mathbf{\hat u}_{m}\) you get \(\mathbf{\hat u}\).

\[\widehat{\mathbf{a}}\mathbf{|}\widehat{\mathbf{u}}\mathbf{=}\left( 1 - \alpha \right)\mathbf{Z'}\frac{2}{m}{\mathbf{G}^{\mathbf{\text{blended}}}}^{- 1}\mathbf{\ }\widehat{\mathbf{u}}\]

\[{\widehat{\mathbf{u}}}_{m}\mathbf{|}\widehat{\mathbf{a}}\mathbf{= Z}\widehat{\mathbf{a}}\]

12.9 Indirect predictions using marker effects

Imagine that we need to make indirect predictions (e.g. once a week) and we want to use estimates of SNP effects without going through the whole process of running GBLUP (or ssGBLUP). We genotype and we obtain a matrix with genotypes \(\mathbf{Z}_{new}\). This matrix needs to be centered by the same allelic frequencies as the original GBLUP or ssGBLUP3.

For indirect predictions of newborn animals, there are two parts: \(\mathbf{u}\mathbf{=}\mathbf{u}_{m} + \mathbf{u}_{p}\). The first part is obtained as a sum of marker effects, plus the difference between genomic and pedigree bases:

\[{\widehat{\mathbf{u}}}_{m} = \widehat{\mu} + {\widehat{\mathbf{u}}}_{m}^{\mathbf{*}}\mathbf{=}\widehat{\mu} + \mathbf{Z}_{new}\widehat{\mathbf{a}}\]

whereas the pedigree part of indirect predictions has to be obtained as the parent average of the parents’ pedigree part, not of the complete breeding value). Note that this parent average only accounts for a small \(\alpha\) part of the genetic variance. This is, for individual i

\[{\widehat{u}}_{p,i} = 0.5\left( {\widehat{u}}_{p,sire(i)} + {\widehat{u}}_{p,dam(i)} \right)\]

If the animal has no records indirect predictions are the same as GBLUP predictions. In SSGBLUP (that we have not described yet), they are the same if both parents are genotyped and almost the same if not. The reason is that \(\mathbf{H}\) matrix is slightly different for the ungenotyped parents if the genotyped animal is included in the SSGBLUP or not.

12.10 Reliabilities of indirect predictions with “tuned G”

  1. Note: this section will need reordering
  2. In this section all matrices G are what we call “tuned” unless otherwise staten

We may consider the reliability of \(u_m\) (breeding value referred to the base of the pedigree) or the reliability of \(u_m^{*}\) (breeding value referred to the genomic population). These two acuracies are for a single individual4:

The difference between the two is the difference between genomic and pedigree bases, \(\mu\), that although it is not explicitely computed in ssGBLUP, it is there. And in fact, this \(\mu\) is estimated with some uncertainty.

For instance, for a single individual we have that \[ Var(u_m^*)=b(1-\alpha)G_{ii}^{untuned}\sigma^2_u \]

\[ Var(u_m)=Var(\mu+u_m^{*})= a(1-\alpha)\sigma^2_u +b(1-\alpha)G_{ii}^{untuned}\sigma^2_u=(1-\alpha)G_{ii}\sigma^2_u \]

where, for the purposes of indirect prediction, it may be easier to build as (for individual \(j\)) \(G_{jj}^{untuned}=\frac{\mathbf{z}_j^{'}\mathbf{z}_j}{2\sum{p_i q_i}}\).

Then we need the \(Var(\hat{u}_m^{*})\). This can be obtained as a function of \(\hat{\mathbf{u}}^*_m = \mathbf{Z}_{new}\widehat{\mathbf{a}}\) such that

\[ Var(\mathbf{\hat{u}}^*_m)=\mathbf{Z}_{new} Var(\widehat{\mathbf{a}}) \mathbf{Z}^{'}_{new} \]

(note that \(\mathbf{Z}\) is the matrix for animals in the (SS)GBLUP evaluation, whereas \(\mathbf{Z}_{new}\) is for animals in indirect prediction) with

\[Var(\widehat{\mathbf{a}}) = (1-\alpha)b\frac{1}{\ 2\Sigma p_{i}q_{i}}\mathbf{Z}^{\mathbf{'}}\mathbf{G}^{\mathbf{- 1}}(\mathbf{G}\sigma_{u}^{2} - \mathbf{C}^{\text{uu}})\mathbf{G}^{\mathbf{- 1}}\mathbf{Z}\frac{1}{\ 2\Sigma p_{i}q_{i}}(1-\alpha)b\]

which gives the alternative (and not necessarily better) expression

\[Var(\mathbf{\hat{u}}^*_m) = (1-\alpha)b\frac{1}{\ 2\Sigma p_{i}q_{i}}\mathbf{Z}_{new}\mathbf{Z}^{\mathbf{'}}\mathbf{G}^{\mathbf{- 1}}(\mathbf{G}\sigma_{u}^{2} - \mathbf{C}^{\text{uu}})\mathbf{G}^{\mathbf{- 1}}\mathbf{Z}\mathbf{Z}_{new}^{'}\frac{1}{\ 2\Sigma p_{i}q_{i}}(1-\alpha)b=\]

\[= (1-\alpha)b\mathbf{G}_{new,old}^{untuned}\mathbf{G}^{\mathbf{- 1}}(\mathbf{G}\sigma_{u}^{2} - \mathbf{C}^{\text{uu}})\mathbf{G}^{\mathbf{- 1}}\mathbf{G}_{old,new}^{untuned}(1-\alpha)b \]

for \(\mathbf{G}_{new,old}^{untuned}=\frac{\mathbf{Z}_{new} \mathbf{Z}^{'}}{2\Sigma p_{i}q_{i}}\) which makes sense as the variance of a selection index.

A bit trickier is to obtain the scalar \(Var(\hat{u}_m)=Var(\hat{\mu}+\hat{u}_m^{*})\)

\[Var(\hat{u}_m)=Var(\hat{\mu}+\hat{u}_m^{*})= Var(\hat{\mu})+Var(\hat{u}_m^{*})+2Cov(\hat{\mu},\hat{u}_m^{*}) \]

or, in matrix form,

\[Var(\mathbf{\hat{u}_m})=Var(\mathbf{1}\hat{\mu}+\mathbf{\hat{u}}_m^{*})=\]

\[=\mathbf{11'}Var(\hat{\mu})+\mathbf{1}Cov(\hat{\mu},\mathbf{\hat{u}}_m^{'*})+Cov(\mathbf{\hat{u}}_m^{*},\hat{\mu})\mathbf{1}^{'}+Var(\mathbf{\hat{u}}^*_m) \]

With \(Var(\mathbf{\hat{u}}^*_m)\) that we already saw. The \(Var(\hat{\mu})\) is as follows.

\[Var(\hat{\mu})=a(1-\alpha)\mathbf{1'G}^{-1}(\mathbf{G}\sigma_{u}^{2}-\mathbf{C}^{uu})\mathbf{G}^{-1}\mathbf{1}a(1-\alpha)\]

which is equal to \(a^2 (1-\alpha)^2\) times the sum of elements of \(\mathbf{G}^{-1}(\mathbf{G}\sigma_{u}^{2}-\mathbf{C}^{uu})\mathbf{G}^{-1}\). This is not difficult.

Then we need \(Cov(\hat{\mu},\mathbf{\hat{u}}_m^{*'})\). This is a row vector (which multiplied by \(\mathbf{1}\) gives a matrix) whose interpretation is “how much does the uncertainty in \(\mu\) affects each animal” (i.e. because its genomic information is poor). Anyway, this is

\[ \begin{aligned} Cov(\hat{\mu},\mathbf{\hat{u}}_m^{*'}) & =a(1-\alpha)\mathbf{1'G}^{-1}Var(\widehat{\mathbf{u}})\mathbf{Z}\mathbf{Z}_{new}^{'} \frac{1}{2\Sigma p_i q_i}b(1-\alpha) \\ & =a(1-\alpha)\mathbf{1'G}^{-1}(\mathbf{G}\sigma_{u}^{2}-\mathbf{C}^{uu})\mathbf{Z}\mathbf{Z}_{new}^{'} \frac{1}{2\Sigma p_i q_i}b(1-\alpha) \end{aligned} \]

which is rather cumbersome and where it appears \(\mathbf{G}_{old,new}^{untuned}=\frac{\mathbf{Z}\mathbf{Z}_{new}^{'}}{2\Sigma p_{i}q_{i}}\) which describes how close are new to old animals.5

It is easier to work with the joint distribution of \(\hat{\mu}, \mathbf{\hat{a}}\) as follows:

\[ \begin{aligned} Var \begin{pmatrix} \hat{\mu} \\ \mathbf{\hat{a}} \end{pmatrix} & = \begin{bmatrix} a(1-\alpha)\mathbf{1} \\ b(1-\alpha)\frac{1}{2\sum{p_i q_i }}\mathbf{Z}' \end{bmatrix} \mathbf{G}^{-1}(\mathbf{G}\sigma_{u}^{2}-\mathbf{C}^{uu}) \mathbf{G}^{-1}\begin{bmatrix} a(1-\alpha)\mathbf{1}' & b(1-\alpha)\frac{1}{2\sum{p_i q_i }}\mathbf{Z} \end{bmatrix} \\ \end{aligned} \]

which expanded gives:

\[ \begin{aligned} Var \begin{pmatrix} \hat{\mu} \\ \mathbf{\hat{a}} \end{pmatrix} & = \begin{pmatrix} a(1-\alpha)\mathbf{1} \mathbf{G}^{-1}(\mathbf{G}\sigma_{u}^{2}-\mathbf{C}^{uu}) \mathbf{G}^{-1} \mathbf{1}' (1-\alpha)a & a(1-\alpha)\mathbf{1} \mathbf{G}^{-1}(\mathbf{G}\sigma_{u}^{2}-\mathbf{C}^{uu}) \mathbf{G}^{-1} \mathbf{Z} \frac{1}{2\sum{p_i q_i }}(1-\alpha)b \\ b(1-\alpha)\frac{1}{2\sum{p_i q_i }}\mathbf{Z}' \mathbf{G}^{-1}(\mathbf{G}\sigma_{u}^{2}-\mathbf{C}^{uu}) \mathbf{G}^{-1} \mathbf{1}' (1-\alpha)a & b(1-\alpha)\frac{1}{2\sum{p_i q_i }}\mathbf{Z}' \mathbf{G}^{-1}(\mathbf{G}\sigma_{u}^{2}-\mathbf{C}^{uu}) \mathbf{G}^{-1} \mathbf{Z} \frac{1}{2\sum{p_i q_i }}(1-\alpha)b \end{pmatrix} \end{aligned} \]

Consider now a single individual with row vector of genotypes \(\mathbf{z}_{new}\). Its total breeding value through indirect prediction (including \(\mu\)) is \(\hat{u}_m=\begin{pmatrix} 1 & \mathbf{z}_{new} \end{pmatrix} \begin{pmatrix} \hat{\mu} \\ \mathbf{\hat{a}} \end{pmatrix}\) and thus the final expression for PEV of the indirect prediction, using \(Var \begin{pmatrix} \hat{\mu} \\ \mathbf{\hat{a}} \end{pmatrix}\) above, is:

\[Var(\hat{u}_m)= \begin{pmatrix} 1 & \mathbf{z}_{new} \end{pmatrix} Var \begin{pmatrix} \hat{\mu} \\ \mathbf{\hat{a}} \end{pmatrix} \begin{pmatrix} 1 \\ \mathbf{z}'_{new} \end{pmatrix}\]

and the final expression for Reliability is

\[Rel_{base-pedigree}=\frac{Var(\hat{u}_m)}{Var(u_m)}\]

with \(Var(\hat{u}_m)\) as above and \(Var(u_m)=(1-\alpha)G_{new,new}\sigma^2_u=(1-\alpha)(a+b \frac{1}{\sum{2 p_i q_i}} \mathbf{z}_{new} \mathbf{z}'_{new} )\sigma^2_u\)

12.11 Reliabilities of indirect predictions with metafounders

Note: here \(\mathbf{G}\) is \(\mathbf{G}^{blended}\) (but not “tuned” because in metafounders tuning is not needed).

The work of (Bermann et al. 2023) proposed a coherent framework to define reliabilities when there are several base populations (metafounders), as contrasts from a particular reference metafounder (\(mf\)). We try to stick to their notation, but to indicate that we build \(\mathbf{G}\) with 0.5 allele frequencies we subscript \(\mathbf{G}\), \(\mathbf{H}\) and \(\mathbf{Z}\) with \(05\), and the genetic variance re-escaled for metafounders is \(\sigma^2_{u,mf}\). Here we derive the reliability of the indirect prediction with metafounders, expressed as the reliability of the contrast. Let \(\hat{u}_m=\mathbf{z}_{new} \hat{a}\), contrasted with the reference metafounder, e.g. \(\mathbf{u}_m - \mathbf{u}_{mf}\). We will call this contrast \(u_c=u_m - u_{mf}\) Reliability is then the squared correlation between the true \(u_c=u_m - u_{mf}\) and the estimated \(\hat{u}_c=\hat{u}_m - \hat{u}_{mf}\) . Then

\[Rel_c = \frac{Var(\hat{u}_c)}{Var(u_c)}\]

To obtain \(Var(\hat{u}_c)\) we express

\[ \hat{u}_c=\begin{pmatrix} -1 & \mathbf{z}_{05new} \end{pmatrix} \begin{pmatrix} \hat{u}_{mf} \\ \hat{\mathbf{a}} \end{pmatrix} \]

First we need to consider the block of the reference metafounder plus genotyped individuals \(\mathbf{u}_g\) :

\[ Var \begin{pmatrix} \hat{u}_{mf} \\ \mathbf{\hat{u}}_{g} \end{pmatrix} = \mathbf{H}_{05block}\sigma^2_{u,mf} - \mathbf{C^{block}} \]

with \(\mathbf{H}_{05block}= \begin{pmatrix} h_{05,mf,mf} & \mathbf{h}_{05,mf,g} \\ \mathbf{h}_{05,g,mf} & \mathbf{G}_{05} \end{pmatrix}\) (because \(\mathbf{H}_{05,g,g} =\mathbf{G}_{05}\)) and \(\mathbf{C^{block}}= \begin{pmatrix} {c}^{22}_{mf,mf} & \mathbf{c}^{22}_{mf,g} \\ \mathbf{c}^{22}_{g,mf} & \mathbf{C}^{22}_{g,g} \end{pmatrix}\)

Note that

\[ \begin{pmatrix} \hat{u}_{mf} \\ \hat{\mathbf{a}} \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & (1-\alpha)\frac{2}{m}\mathbf{Z}_{05} \mathbf{G}^{-1}_{05} \end{pmatrix} \begin{pmatrix} \hat{u}_{mf} \\ \hat{\mathbf{u}}_g \end{pmatrix} \]

from which

\[ \begin{aligned} Var\begin{pmatrix} \hat{u}_{mf} \\ \hat{\mathbf{a}} \end{pmatrix} & = \begin{pmatrix} 1 & 0 \\ 0 & (1-\alpha)\frac{2}{m}\mathbf{Z}_{05} \mathbf{G}^{-1}_{05} \end{pmatrix} \left[ \begin{pmatrix} h_{05,mf,mf}\sigma^2_{u,mf} & \mathbf{h}_{05,mf,g}\sigma^2_{u,mf} \\ \mathbf{h}_{05,g,mf}\sigma^2_{u,mf} & \mathbf{G}_{05}\sigma^2_{u,mf} \end{pmatrix} - \begin{pmatrix} {c}^{22}_{mf,mf} & \mathbf{c}^{22}_{mf,g} \\ \mathbf{c}^{22}_{g,mf} & \mathbf{C}^{22}_{g,g} \end{pmatrix} \right] \begin{pmatrix} 1 & 0 \\ 0 & \mathbf{G}^{-1}_{05} \mathbf{Z}'_{05} \frac{2}{m} (1-\alpha) \end{pmatrix} \\ &= \begin{bmatrix} (h_{05,mf,mf}\sigma^2_{u,mf} - {c}^{22}_{mf,mf}) & (\mathbf{h}_{05,mf,g}\sigma^2_{u,mf} - \mathbf{c}^{22}_{mf,g})\mathbf{G}^{-1}_{05} \mathbf{Z}'_{05} \frac{2}{m} (1-\alpha) \\ (1-\alpha)\frac{2}{m}\mathbf{Z}_{05} \mathbf{G}^{-1}_{05} (\mathbf{h}_{05,g,mf}\sigma^2_{u,mf} - \mathbf{c}^{22}_{g,mf}) & (1-\alpha)\frac{2}{m}\mathbf{Z}_{05} \mathbf{G}_{05}^{-1}(\mathbf{G}_{05}\sigma^2_{u,mf} - \mathbf{C}^{22}_{g,g})\mathbf{G}^{-1}_{05} \mathbf{Z}'_{05} \frac{2}{m} (1-\alpha) \end{bmatrix} \end{aligned} \]

Finally, we create the quadratic form inserting the above expression:

\[ Var(\hat{u}_c)= \begin{pmatrix} -1 & \mathbf{z}_{05new} \end{pmatrix} Var\begin{pmatrix} \hat{u}_{mf} \\ \hat{\mathbf{a}} \end{pmatrix} \begin{pmatrix} -1 \\ \mathbf{z}'_{05new} \end{pmatrix} \]

The following step is deriving \(Var(u_c)\). This would require building the \(H_\Gamma\) matrix including all previous individuals plus the new one, to build

\[Var(u_c)=(H_{\Gamma , mf,mf}-2H_{\Gamma , mf,new}+H_{\Gamma , new,new})\sigma^2_{u,mf}\]

The element \(H_{\Gamma , mf,mf}\) can be stored from the run. The element \(H_{\Gamma , new,new}=\frac{2}{m}\mathbf{z}_{new} \mathbf{z}'_{new}\). As for \(H_{\Gamma , mf,new}\), this element seems hard to obtain because it involves in principle all genotyped animals plus the new one.

Another practical solution is to work not with \(H\) but with \(A\) assuming the following:

\[Var(u_c) \approx (A_{\Gamma , mf,mf}-2A_{\Gamma , mf,new}+A_{\Gamma , new,new})\sigma^2_{u,mf}\]

where \(A_{\Gamma , new,new}=1+F_{\Gamma,new}\) can be obtained with an inbreeding algorithm, \(A_{\Gamma , mf,mf}=\Gamma_{mf,mf}\) and \(A_{\Gamma , mf,new}= \mathbf{q}_{new} \Gamma_{:,mf}\) i.e. a vector of metafounders proportions in \(new\), times the \(mf\) column in \(\Gamma\).

The next possibility is probably better and simpler.

12.11.1 Reliability with assumed allele frequencies for the reference metafounder

Another possibility is to use \(u_{mf} = (2\mathbf{p}_{mf} -\mathbf{1}) \mathbf{a}\) and derive both \(Var(\hat{u}_c)\) and \(Var(u_c)\). This assumes (perfect) knowledge of \(\mathbf{p}_{mf}\), row vector of allele frequencies, which may be estimated by Gengler’s method (equivalently, GLS) or another method. Note that if allele frequencies of the metafounder were perfectly known, then \(H_{\Gamma , mf,mf}=\frac{2}{m}(2\mathbf{p}_{mf} -\mathbf{1})(2\mathbf{p}_{mf} -\mathbf{1})'\) and \(H_{\Gamma , mf,new}=\frac{2}{m}(2\mathbf{p}_{mf} -\mathbf{1})\mathbf{z}'_{05new}\) .

Then:

\[ \begin{aligned} \hat{u}_c & = \begin{pmatrix} -1 & \mathbf{z}_{05new} \end{pmatrix} \begin{pmatrix} \hat{u}_{mf} \\ \hat{\mathbf{a}} \end{pmatrix} = \begin{pmatrix} -1 & \mathbf{z}_{05new} \end{pmatrix} \begin{pmatrix} (2\mathbf{p}_{mf} -\mathbf{1}) \mathbf{\hat{a}} \\ \hat{\mathbf{a}} \end{pmatrix} \\ & = \begin{pmatrix} -1 & \mathbf{z}_{05new} \end{pmatrix} \begin{pmatrix} \hat{u}_{mf} \\ \hat{\mathbf{a}} \end{pmatrix} = (\mathbf{m}_{new}-2\mathbf{p}) \hat{\mathbf{a}} \end{aligned} \]

with \(\mathbf{m}_{new}\) coded as \({0,1,2}\). In fact, \((\mathbf{m}_{new}-2\mathbf{p})\) is simply the original (P. M. VanRaden 2008) coding, with associated

\[Var(\hat{u}_c)= (\mathbf{m}_{new}-2\mathbf{p}) Var(\hat{\mathbf{a}}) (\mathbf{m}_{new}-2\mathbf{p})'\]

and in the denominator we have

\[ Var(u_c) = \frac{2}{m}(\mathbf{m}_{new}-2\mathbf{p})(\mathbf{m}_{new}-2\mathbf{p})' \sigma^2_{u,mf} \]

which is the relationship of the new individual times the genetic variance. It can be probably shown that for a single metafounder and base allele frequencies \(\mathbf{p}\), this is identical to the regular expression involving no metafounders.

The final expression is

\[Rel_c = \frac{Var(\hat{u}_c)}{Var(u_c)}=\frac{(\mathbf{m}_{new}-2\mathbf{p}) Var(\hat{\mathbf{a}}) (\mathbf{m}_{new}-2\mathbf{p})'}{\frac{2}{m}(\mathbf{m}_{new}-2\mathbf{p})(\mathbf{m}_{new}-2\mathbf{p})' \sigma^2_{u,mf}}\]

where \(Var(\hat{\mathbf{a}})\) was shown above.

If instead of the reference metafounder \(p\)’s, we use “current” allele frequencies \(p\), the reliability obtained will refer to the current population.

12.12 Bayesian distribution of marker effects from GBLUP

Imagine now that, from GBLUP (or GGibbs…), we obtain the posterior distribution of u, i.e. from inversion of the Mixed Model Equations or from MonteCarlo, as:

\[p\left( \mathbf{u} \middle| \mathbf{y} \right) = N\left( \widehat{\mathbf{u}},\mathbf{C}^{\text{uu}} \right)\]

where \(Var\left( \mathbf{u}\mathbf{|}\mathbf{y} \right) = \mathbf{C}^{\text{uu}}\) is the posterior covariance matrix (I am using Hendersonian notation here). To derive the posterior distribution of marker effects, we multiply the conditional distribution \(p\left( \mathbf{a} \middle| \mathbf{u} \right)\) by the posterior distribution \(p\left( \mathbf{u} \middle| \mathbf{y} \right)\). This has two parts, first to account for the incertitude in \(\widehat{\mathbf{u}}\) contained in \(\mathbf{C}^{\text{uu}}\) as:

\[Var\left( \widehat{\mathbf{a}}\mathbf{|y} \right) = Var\left( \widehat{\mathbf{a}} \middle| Var\left( \mathbf{u} \middle| \mathbf{y} \right) \right)\]

\[=Var\left( \mathbf{D}\mathbf{Z}^{\mathbf{'}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{- 1}\mathbf{\ }(\mathbf{u}-\widehat{\mathbf{u}}) \right)\mathbf{= D}\mathbf{Z}^{\mathbf{'}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{- 1}\mathbf{C}^{\text{uu}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{- 1}\mathbf{ZD}\]

and second, for the remaining noise

\[Var(\mathbf{a -}\widehat{\mathbf{a}}\mathbf{|y) = D - D}\mathbf{ Z}^{\mathbf{'}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{\mathbf{- 1}}\mathbf{ZD}\]

which gives

\[Var\left( \mathbf{a} \middle| \mathbf{y} \right) = \mathbf{D - D}\mathbf{Z}^{\mathbf{'}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{\mathbf{- 1}}\mathbf{\ ZD + D}\mathbf{Z}^{\mathbf{'}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{- 1}\mathbf{C}^{\text{uu}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{- 1}\mathbf{ZD}\]

which is a Bayesian distribution like the one obtained, e.g. using Gibbs Sampling as in SNP-BLUP.

Putting \(\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)\mathbf{=}\mathbf{G}\sigma_{u}^{2}\) and reordering yields

\[Var\left( \mathbf{a} \middle| \mathbf{y} \right) = \mathbf{D + D}\mathbf{Z}^{\mathbf{'}}\mathbf{G}^{\mathbf{- 1}}\sigma_{u}^{- 2}(\mathbf{C}^{\text{uu}} - \mathbf{G}\sigma_{u}^{2})\mathbf{G}^{\mathbf{- 1}}\sigma_{u}^{- 2}\mathbf{ZD}\]

Perhaps is more enlightening to consider the alternative expression

\[Var\left( \mathbf{a} \middle| \mathbf{y} \right) = \mathbf{D - D}\mathbf{Z}^{\mathbf{'}}\mathbf{G}^{\mathbf{- 1}}\sigma_{u}^{- 2}(\mathbf{G}\sigma_{u}^{2} - \mathbf{C}^{\text{uu}})\mathbf{G}^{\mathbf{- 1}}\sigma_{u}^{- 2}\mathbf{ZD}\]

which is composed of two terms: first, the a priori variance of marker effects, \(\mathbf{D}\). Second, an a posteriori reduction, given by the data, in their incertitude. This reduction comes from a reduction in the incertitude of the genomic breeding values \((\mathbf{G}\sigma_{u}^{2} - \mathbf{C}^{\text{uu}})\) which is in turn transferred to the marker effects via the linear operator \(\mathbf{D}\mathbf{Z}^{\mathbf{'}}\).

If \(\mathbf{C}^{\text{uu}} \approx \mathbf{0}\) (well known animals such as progeny tested bulls) this yields \(Var\left( \mathbf{a} \middle| \mathbf{y} \right)\) \(\mathbf{D}\mathbf{-}\mathbf{D}\mathbf{Z}^{\mathbf{'}}\mathbf{G}^{\mathbf{-}\mathbf{1}}\sigma_{u}^{2}\mathbf{ZD}\).

If \(\mathbf{D} = \mathbf{I}\sigma_{u}^{2}/2\Sigma p_{i}q_{i}\) (the usual assumption where, as discussed in previous sections, \(\mathbf{ZD}\mathbf{Z}^{\mathbf{'}}\mathbf{=}\mathbf{G}\sigma_{u}^{2}\)) the expressions above become

The estimate of the marker effects is

\[\mathbf{E}\left( \mathbf{a|y} \right)\mathbf{=}\widehat{\mathbf{a}}\mathbf{|}\widehat{\mathbf{u}}\mathbf{= D}\mathbf{Z}^{\mathbf{'}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{- 1}\mathbf{\ }\widehat{\mathbf{u}}\mathbf{=}\frac{1}{\ 2\Sigma p_{i}q_{i}}\mathbf{Z}^{\mathbf{'}}\mathbf{\ }\mathbf{G}^{- 1}\mathbf{\ }\widehat{\mathbf{u}}\]

with covariance matrix

\[Var\left( \mathbf{a} \middle| \mathbf{y} \right) = \frac{\sigma_{u}^{2}}{\ 2\Sigma p_{i}q_{i}}\mathbf{I -}\frac{1}{\ 2\Sigma p_{i}q_{i}}\mathbf{Z}^{\mathbf{'}}\mathbf{G}^{\mathbf{- 1}}(\mathbf{G}\sigma_{u}^{2} - \mathbf{C}^{\text{uu}})\mathbf{G}^{\mathbf{- 1}}\mathbf{Z}\frac{1}{\ 2\Sigma p_{i}q_{i}}\]

So that the full distribution of marker effects can be deduced from breeding values by backsolving using the genomic relationship matrix and markers’ incidence matrix.

12.13 Frequentist distribution of marker effects from GBLUP.

This is described in (Gualdrón Duarte et al. 2014). The distribution of interest is \(Var\left( \widehat{\mathbf{a}} \right)\), the frequentist variance of the estimators integrating over the conceptual distribution of all possible \(\mathbf{y}\)’s. Using results in Henderson (Charles R. Henderson 1975 ; C. R. Henderson 1984) we get that, similar (but not identical) as above:

\[Var(\widehat{\mathbf{a}}) = \mathbf{D}\mathbf{Z}^{\mathbf{'}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{- 1}\left( {\mathbf{ZD}\mathbf{Z}^{\mathbf{'}}\mathbf{-}\mathbf{C}}^{\text{uu}} \right)\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{- 1}\mathbf{Z}\mathbf{D}_{\mathbf{a}}\]

or

\[Var(\widehat{\mathbf{a}}) = \frac{1}{\ 2\Sigma p_{i}q_{i}}\mathbf{Z}^{\mathbf{'}}\mathbf{G}^{\mathbf{- 1}}(\mathbf{G}\sigma_{u}^{2} - \mathbf{C}^{\text{uu}})\mathbf{G}^{\mathbf{- 1}}\mathbf{Z}\frac{1}{\ 2\Sigma p_{i}q_{i}}\]

The difference is the term \(\mathbf{D}\) because instead of computing \(Var\left( \mathbf{a} \middle| \mathbf{y} \right)\) they compute \(Var\left( \widehat{\mathbf{a}} \right)\). In fact, \(Var\left( \widehat{\mathbf{a}} \right) = Var\left( \mathbf{a} \right) - Var\left( \mathbf{a} \middle| \mathbf{y} \right)\mathbf{.}\)

When matrix \(G\) has been “tuned”, the expression is (Ignacio Aguilar et al. 2019):

\[Var(\widehat{\mathbf{a}}) = (1-\alpha)b\frac{1}{\ 2\Sigma p_{i}q_{i}}\mathbf{Z}^{\mathbf{'}}\mathbf{G}^{\mathbf{- 1}}(\mathbf{G}\sigma_{u}^{2} - \mathbf{C}^{\text{uu}})\mathbf{G}^{\mathbf{- 1}}\mathbf{Z}\frac{1}{\ 2\Sigma p_{i}q_{i}}(1-\alpha)b\]

12.13.1 Example of marker predictions from GBLUP

Let there be two individuals and three markers and \(\sigma_{u}^{2} = 1\):

\(\mathbf{Z}\mathbf{=}\begin{pmatrix} 1 & 1 & 1 \\ 0 & 1 & 0 \\ \end{pmatrix}\) and \(\mathbf{D}\mathbf{=}\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ \end{pmatrix}\), both \(\mathbf{ZD}\mathbf{Z}^{\mathbf{'}}\) and \(\mathbf{D}\) are positive definite, however \(\mathbf{D}\mathbf{-}\mathbf{D}{\mathbf{\ }\mathbf{Z}}^{\mathbf{'}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{\mathbf{-}\mathbf{1}}\mathbf{\ }\mathbf{ZD}\) is not full rank. The reason is complete LD between markers 1 and 3. Therefore for a given value of \(\mathbf{u}\), there will be infinite possible combinations. Say that \(\widehat{\mathbf{u}}\mathbf{=}\begin{pmatrix} 1 \\ -2 \\ \end{pmatrix}\). Then there are many possible solutions of \(\mathbf{a}\) yielding these \(\widehat{\mathbf{u}}\), for instance \(\begin{pmatrix} 0 & - 2 & 3 \\ \end{pmatrix}\) or \(\begin{pmatrix} -3 & - 2 & 6 \\ \end{pmatrix}\). However, \(\mathbf{a}\) has an a priori structure \(\mathbf{D}\) that makes that the effects of the first and third SNP have a priori the same size, thus the most likely solution will be \(\widehat{\mathbf{a}}\mathbf{=}\mathbf{D}\mathbf{Z}^{\mathbf{'}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{- 1}\mathbf{\ }\widehat{\mathbf{u}}\mathbf{=}\begin{pmatrix} 1.5 & - 2 & 1.5 \\ \end{pmatrix}\) so their effect is averaged. The conditional distribution of \(\mathbf{a}\) given \(\mathbf{u}\) has a variance

\[Var\left( \mathbf{a} \middle| \mathbf{u} \right)=\mathbf{D - D}\mathbf{Z}^{\mathbf{'}}\left( \mathbf{ZD}\mathbf{Z}^{\mathbf{'}} \right)^{\mathbf{- 1}}\mathbf{\ ZD =}\begin{pmatrix} 0.5 & 0 & - 0.5 \\ 0 & 0 & 0 \\ - 0.5 & 0 & 0.5 \\ \end{pmatrix}\]

which shows well that the first and third markers are in LD (and their estimates cannot be disentangled) whereas the second has a unique solution for a given \(u\). Assume that \(u\) are estimated with a posteriori covariance error (or prediction error covariance)

\[Var\left( \mathbf{u} \middle| \mathbf{y} \right)\mathbf{=}\mathbf{C}^{\mathbf{\text{uu}}}\mathbf{=}\begin{pmatrix} 1.5 & 0 \\ 0 & 0.2 \\ \end{pmatrix}\]

Then, the incertitude in the estimation of marker effects is

\[Var\left( \mathbf{a} \middle| \mathbf{y} \right)\mathbf{=}\begin{pmatrix} 0.925 & - 0.1 & - 0.075 \\ - 0.1 & 0.2 & - 0.1 \\ - 0.075 & - 0.1 & 0.925 \\ \end{pmatrix}\]

The difference between \(Var\left( \mathbf{a} \middle| \mathbf{y} \right)\) and \(Var\left( \mathbf{a} \middle| \mathbf{u} \right)\) is actually \(Var\left( \widehat{\mathbf{a}} \middle| \mathbf{y} \right)\), and has value

\[Var\left( \widehat{\mathbf{a}} \middle| \mathbf{y} \right)\mathbf{=}\begin{pmatrix} 0.425 & - 0.1 & 0.425 \\ - 0.1 & 0.2 & - 0.1 \\ 0.425 & - 0.1 & 0.425 \\ \end{pmatrix}\]

It can be seen that this conditional variance does not account for the LD across markers 1 and 3, or, in other words, it ignores the fact that their sum is the only thing that can be well estimated.

Last, the \(Var\left( \widehat{\mathbf{a}} \right)\) is

\[Var\mathbf{(}\widehat{\mathbf{a}}\mathbf{) =}\begin{pmatrix} 0.075 & 0.1 & 0.075 \\ 0.1 & 0.8 & 0.1 \\ 0.425 & 0.1 & 0.075 \\ \end{pmatrix}\]

12.14 GREML and G-Gibbs

Use of genomic relationships to estimate variance components is trivial, and popular methods REML and Gibbs sampler have often been used (O. F. Christensen and Lund 2010 ; Rodríguez-Ramilo, García-Cortés, and González-Recio 2014 ; Jensen, Su, and Madsen 2012) . Also, older estimates using relationships based on markers are common in the conservation genetics literature. Often, people call GBLUP something that in fact is GREML. The difference is that in GREML variance components are obtained, whereas in GBLUP these are fixed a priori.

As discussed, the estimates obtained by GREML or G-Gibbs refer to a base population with the assumed allelic frequencies (usually the observed ones) and in Hardy-Weinberg equilibrium. Therefore, these estimates are not necessarily comparable to pedigree estimates, that refer to another base population. Imagine that, for the same data set, you try three different matrices of relationships. Let’s say that you have genomic, pedigree and kernel with respective matrices \(\mathbf{A}_{22}\) , \(\mathbf{G}\) and \(\mathbf{K}\) and variance component estimates \(\sigma_{u_{A}}^{2},\sigma_{u_{G}}^{2},\sigma_{u_{K}}^{2}\). They do not refer to the same conceptual base populations. We proposed a method to compare estimates (Andres Legarra 2016). The method basically says that in order to be comparable, all matrices should have similar statistics (average of the diagonal and of the matrix itself).

Further, data sets are often different, making comparison unreliable. In particular, heritability estimates using so-called “unrelated” populations (Yang et al. 2010) have large standard errors (making comparisons unreliable) and refer to a very particular population, whereas pedigree-based estimates refer to another population.

12.15 Complicated things in GBLUP

12.15.1 Variances of pseudo-data, DYD’s, and de-regressed proofs

Often, pseudo-phenotypes are used. These can consist in results of field trials, in progeny performances (P. VanRaden and Wiggans 1991), or in own corrected phenotypes. Other type of data are the deregressed proofs (Garrick, Taylor, and Fernando 2009 ; Ricard, Danvy, and Legarra 2013 ) , that consist in post-processing of pedigree-based genetic evaluations. These pseudo-data do not come from a regular phenotype and have varying variances. However, they do come with a measure of uncertainty (i.e., a bull can have 10 or 10,000 daughters). This can be accounted for in the residual covariance matrix, \(\mathbf{R}\), which becomes heterogeneous.

In most software (for instance GS3, blupf90 and the R function lm), this is done using weights. Weight \(w_{i}\) means (informally) the “importance” attached to the \(i\)-th record, and (formally) means that the record \(i\) behaves like an average of \(w_{i}\) observations, so that

\(\mathbf{R}\mathbf{=}\begin{pmatrix} 1/w_{1} & 0 & 0 \\ 0 & 1/w_{2} & 0 \\ 0 & 0 & \ldots \\ \end{pmatrix}\sigma_{e}^{2}\)

More weight means reduced residual variance. There are basically two ways to proceed.

Dairy cattle breeders work with “daughter yield deviations” (DYD). These are the average phenotypes of daughters for every bull, corrected for the EBV of their dam and environmental effects. Also, an “equivalent daughter contribution” (edc) is computed for the DYD, which reflects the number of daughters of that bull. The pseudo-phenotype for each bull is thus modeled as twice the DYD. If correction was perfect, a 2DYD for bull \(i\)with\(\ n_{i}\) daughters can be decomposed as:

\[2DYD_{i} = u_{i} + 2\frac{1}{n_{i}}\sum_{j}^{}\phi_{j} + 2\frac{1}{n_{i}}\sum_{j}^{}e_{j} = u_{i} + \frac{1}{n_{i}}\sum_{j}^{}\epsilon_{j}\]

That is, the bull EBV \((u_{i}\)), (twice) the average of its daughters’ Mendelian sampling (\(\phi_{j}\)), and the average of its daughters’ residual deviations (\(e_{j})\). The two latter terms are confounded into a pseudo-residual \(\epsilon\). Then, \(Var\left( \epsilon \right) = 4Var\left( \phi \right) + 4Var\left( e \right) = 2\sigma_{u}^{2} + 4\sigma_{e}^{2}\), because the variance of the Mendelian sampling is half the genetic variance. Finally,

\[Var\left( 2DYD_{i} \right) = \sigma_{u}^{2} + \frac{1}{n_{i}}\sigma_{\epsilon}^{2}\]

Thus, in dairy studies one may use \(2\text{DYD}\) as a trait, with the typical genetic variance of \(\sigma_{u}^{2}\) and a pseudo-residual variance of \(\sigma_{\epsilon}^{2} = 2\sigma_{u}^{2} + 4\sigma_{e}^{2}\) with a weight \(w_{i} = n_{i}\), where \(n_{i}\) is the “equivalent daughter contribution”.

For another kind of data, [(Garrick, Taylor, and Fernando 2009) proposed a rather general approach for several kinds of pseudodata. They also provide expressions to put the adequate weights.

12.15.2 Some problems of pseudo-data

Note that the residual covariances of pseudo-data are assumed null. This is wrong. Cows in the same herd will share errors in estimation of the herd effect, and this generates a residual covariance; cows born from the same dam will share errors in estimation of the dam effect, and this also generates a residual covariance; and so on. These errors are ignored. However, Henderson (C. Henderson 1978) showed, in a similar context, that using precorrected data may lead to considerable bias and to loss of accuracy. This is, however, not a problem if pseudorecords used are from progeny testing, in which case the amount of information is so large that covariances among pseudo-data are very small.

13 Non-additive genetic effects in genomic selection

A recent review has been published (Luis Varona et al. 2018), and we refer the reader to it for most of this section.

13.1 Dominant genomic relationships

Under quantitative genetics theory, the additive or breeding value for an i-th individual \((u_{i})\) involves the substitution effects of the genes \((\alpha)\)

\[\alpha = a + d(q - p)\]

which includes the “biological” additive effect \(a\), the “biological” dominant effect \(d\) of the genes and the allele frequencies. So, the breeding values of a set of individuals are \(\mathbf{u}\mathbf{=}\mathbf{Z}\alpha\). With no dominant effect of the gene \((d = 0)\), \(\alpha = a\) and \(\mathbf{u}=\mathbf{Za}\) as was defined in the previous sections.

If we consider one locus with two alleles \((A_{1}\ \text{and}\ A_{2})\), a biological effect for each genotype can be defined, \(A_{1}A_{1} = \ a\), \(A_{1}A_{2} = \ d\) and \(A_{2}A_{2} = - a\), for instance as deviations from the midpoint of the two homozygous as in (Falconer and Mackay 1996). Naturally, a model that fits additive and dominant genotypic effects of the gene (or marker) can be written as

\[\mathbf{y} = \mathbf{1}\mu + \mathbf{Ta} + \mathbf{Xd + e}\]

where “biological” additive effects \(\mathbf{a}\) and “biological” dominant effects \(\mathbf{d}\) for a set of individuals are included for each of the \(n\) markers (M. A. Toro and Varona 2010). It will be discussed in more details later in these notes.

This “intuitive” and useful model fits “biological” effect of gene or markers, while traditional quantitative genetic talks about “statistical” effects (William G. Hill, Goddard, and Visscher 2008). Breeding values, dominance deviations, epistatic deviations and their variance components are statistical outcomes defined in a population context.

Thus, a genomic dominant model directly comparable to the classical genetic model (e.g. pedigree-based BLUP) has to involve breeding values \(\mathbf{u}\) and dominance deviation \(\mathbf{v}\) as

\[\mathbf{y} = \mathbf{1}\mu + \mathbf{u} + \mathbf{v + e}\]

As in (Falconer and Mackay 1996) (Table 7.3), the breeding value for an individual is \(u_{A_{1}A_{1}} = 2q\alpha = \left( 2 - 2p \right)\alpha\), \(u_{A_{1}A_{2}} = \left( q - p \right)\alpha = \left( 1 - 2p \right)\alpha\) or \(u_{A_{2}A_{2}} = \left( - 2p \right)\alpha\), depending on its genotype and \(p\) is the frequency of \(A_{1}\). So, the breeding values of a set of individuals are \(\mathbf{u} = \mathbf{Z}\boldsymbol{\alpha}\) (with \(\mathbf{Z}\) coded as in (P. M. VanRaden 2008)). The element of \(\mathbf{Z}\) for an individual \(i\) at the marker \(j\) is

\[Z_{\text{ij}} = \left\{ \begin{matrix} (2 - 2p_{j}) \\ (1 - 2p_{j}) \\ - 2p_{j} \\ \end{matrix} \right.\ \mathrm{\text{for\ genotypes}}\left\{ \begin{matrix} A_{1}A_{1} \\ A_{1}A_{2} \\ A_{2}A_{2} \\ \end{matrix} \right.\]

Also, the dominant deviation of an individual is \(v_{A_{1}A_{1}} = - 2q^{2}d\), \(v_{A_{1}A_{2}} = 2\text{pqd}\) and \(v_{A_{2}A_{2}} = - 2p^{2}d\). Hence, for a set of individuals, the dominance deviations are \(\mathbf{v} = \mathbf{Wd}\) with the element of \(\mathbf{W}\) for an individual \(i\) at the marker \(j\) equal to

\[W_{\text{ij}} = \left\{ \begin{matrix} - 2q_{j}^{2} \\ 2p_{j}q_{j} \\ - 2p_{j}^{2} \\ \end{matrix}\mathrm{\text{\ \ for\ genotypes}}\left\{ \begin{matrix} A_{1}A_{1} \\ A_{1}A_{2} \\ A_{2}A_{2} \\ \end{matrix} \right.\ \right.\]

Note that breeding value \((u)\) involves both “biological” additive and dominant effects of the markers \((a\ \text{and}\ d)\); dominance deviation \((v)\) only includes a portion of the biological dominant effects of the markers \((d)\).

From this information (also in Table 7.3 in Falconer and Mackay, 1996), the variance of breeding values and the variance of dominance deviations are obtained. The additive genetic variance is \(\sigma_{u}^{2} = 2\text{pq}\left\lbrack a + d(q - p) \right\rbrack^{2} = 2\text{pq}\alpha^{2}\) with \(E\left( u \right) = 0\). Additive variance includes variation due to the additive and dominant effects of the markers.

Also, like breeding values, the mean of dominance deviation is \(E\left( v \right) = 0\),

\[E\left( v \right) = p^{2}\left( - 2q^{2}d \right) + 2pq\left( 2pqd \right) + q2\left( - 2p^{2}d \right) = 0\]

and the dominance genetic variance is equal to \(\sigma_{v}^{2} = {E\left( v^{2} \right) - \left\lbrack E(v) \right\rbrack}^{2} = E\left( v^{2} \right)\), so

\[\sigma_{v}^{2} = p^{2}\left( - 2q^{2}d \right)^{2} + 2pq\left( 2pqd \right)^{2} + q2\left( - 2p^{2}d \right)^{2} = 4p^{2}q^{2}d^{2}(q^{2} + 2pq + p^{2})\]

\[\sigma_{v}^{2} = \left\lbrack 2pqd \right\rbrack^{2}\]

Dominance deviation variance only include a portion of the biological dominant effect of the markers.

Extended to several markers, and considering marker effects as random, this gives

\[\sigma_{u}^{2} = \sum_{j = 1}^{\text{nsnp}}{\left( 2p_{j}q_{j} \right)\sigma_{a0}^{2}} + \sum_{j = 1}^{\text{nsnp}}\left( 2p_{j}q_{j}\left( q_{j} - p_{j} \right)^{2} \right)\sigma_{d0}^{2}\]

\[\sigma_{v}^{2} = \sum_{j = 1}^{\text{nsnp}}{\left( 2p_{j}q_{j} \right)^{2}\sigma_{d0}^{2}}\]

where \(\sigma_{a0}^{2}\) and \(\sigma_{d0}^{2}\) are the SNP variances for additive and dominant components, respectively.

The total genetic variance is \(\sigma_{g}^{2} = \sigma_{u}^{2} + \sigma_{v}^{2}\), the first term is the additive genetic variance and the second term corresponds to the dominance genetic variance or dominance deviation variance. Note that the “statistical” partition of the variance in statistical components due to additivity, dominance and epistasis does not reflect the “biological” effects of the genes (Huang and Mackay 2016) though it is useful for prediction and selection decisions. Even when the genes have a biological or functional dominant action, this variation is mostly captured by the additive genetic variance (W. g. Hill and Mäki-Tanila 2015).

The “statistical” or classical parameterization implies linkage equilibrium and a population in Hardy-Weinberg equilibrium. Assuming uncorrelated random marker effects \((a,\ d)\), it can be extended to multiple loci (P. M. VanRaden 2008 ; Daniel Gianola et al. 2009) and obtained

\[Var\left( \mathbf{u} \right) = \frac{\mathbf{Z}\mathbf{Z}^{\mathbf{'}}}{2\sum_{j = 1}^{\text{nsnp}}{p_{j}q_{j}}}\sigma_{u}^{2} = \mathbf{G}\sigma_{u}^{2}\]

as in (Zulma G. Vitezica, Varona, and Legarra 2013) which is the classical additive genomic relationship matrix G-matrix of GBLUP (P. M. VanRaden 2008). Note that the variance component is \(\sigma_{u}^{2} = \sum\left( 2p_{j}q_{j} \right)\sigma_{a0}^{2} + \sum 2p_{j}q_{j}\left( q_{j} - p_{j} \right)^{2}\sigma_{d0}^{2}\).

For the dominant deviations \(\mathbf{v}\), its variance-covariance matrix is:

\[Var\left( \mathbf{v} \right) = \mathbf{WW'}\sigma_{d0}^{2}\]

After dividing by the variance of the dominance deviations which is

\[\sigma_{v}^{2} = \sum_{j = 1}^{\text{nsnp}}{\left( 2p_{j}q_{j} \right)^{2}\sigma_{d0}^{2}}\]

the dominant genomic relationship matrix, D, is obtained as

\[Var\left( \mathbf{v} \right) = \frac{\mathbf{W}\mathbf{W}^{\mathbf{'}}}{\sum_{j = 1\ }^{\text{nsnp}}\left( 2p_{j}q_{j} \right)^{2}}\sigma_{v}^{2} = \mathbf{D}\sigma_{v}^{2}\]

The dominant genomic matrix has some features as it was presented in a previous section for G. Remember that in a base population in Hardy-Weinberg equilibrium, the average of the diagonal of \(\mathbf{G}\) is one, whereas the average off-diagonal is 0. In the same conditions (base population in Hardy-Weinberg conditions), it turns out that the diagonal of \(\mathbf{D}\) sums to

\[\frac{\left\lbrack p^{2}\left( - 2q^{2} \right)^{2} + 2\text{pq}\left( 2\text{pq} \right)^{2} + q^{2}\left( - 2p^{2} \right)^{2} \right\rbrack\ }{\left( 2\text{pq} \right)^{2}}\]

for one locus, which is equal to 1. In addition, the sum of off-diagonal elements of \(\mathbf{D}\) which can be written as

\[\frac{\begin{pmatrix} p^{2} & 2\text{pq} & q^{2} \\ \end{pmatrix}\begin{pmatrix} - 2q^{2} \\ 2\text{pq} \\ - 2p^{2} \\ \end{pmatrix}\begin{pmatrix} - 2q^{2} \\ 2\text{pq} \\ - 2p^{2} \\ \end{pmatrix}^{\mathbf{'}}\begin{pmatrix} p^{2} & 2\text{pq} & q^{2} \\ \end{pmatrix}^{\mathbf{'}}}{\left( 2\text{pq} \right)^{2}}\]

for one locus, sums to \(0\). Both features correspond to proper definitions of dominant relationships in a base population.

13.2 Animal model GDBLUP

With dominant genomic relationships defined in the previous section as \(Var\left( \mathbf{v} \right) = \mathbf{D}\sigma_{v}^{2}\), the use of this matrix in a mixed model context for genomic predictions in GBLUP form is straightforward. We have the following linear mixed model:

\[\mathbf{y = Xb + Hu + Hv + e}\]

where \(\mathbf{H}\) is an incidence matrix (here it is not the matrix of SSGBLUP) linking individuals to records (phenotypes). With \(Var\left( \mathbf{u} \right) = \mathbf{G}\sigma_{u}^{2}\), \(Var\left( \mathbf{e} \right) = \mathbf{R}\) and assuming multivariate normality, Henderson’s mixed model equations are:

\[\begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{X} & \mathbf{X}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{H} & \mathbf{X}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{H} \\ \mathbf{H}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{X} & \mathbf{H}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{H +}\mathbf{G}^{- 1}\sigma_{u}^{- 2}\ & \mathbf{H}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{H} \\ \mathbf{H}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{X} & \mathbf{H}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{H} & \mathbf{H}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{H +}\mathbf{D}^{- 1}\sigma_{v}^{- 2}\ \\ \end{pmatrix}\ \begin{pmatrix} \widehat{\mathbf{b}} \\ \widehat{\mathbf{u}} \\ \widehat{\mathbf{v}} \\ \end{pmatrix} = \begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{y} \\ \mathbf{H}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{y} \\ \mathbf{H}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{y} \\ \end{pmatrix}\]

These equations are identical to regular animal model, with the exception that genomic relationships in \(\mathbf{G}\) and \(\mathbf{D}\) are used instead of pedigree relationships. The breeding values and the dominance deviations can be predicted from these equations in a population in Hardy Weinberg and linkage equilibrium.

Note that it is not possible to use progeny records (DYD’s) to predict dominance, because dominance deviations average to 0 across the progeny.

With the exception of (Aliloo et al. 2016) (for fat yield in Holstein), in most studies, the inclusion of dominance in GBLUP model did not improve predictive ability of the model (Su, Christensen, et al. 2012 ; Ertl et al. 2014 ; Xiang et al. 2016 ; Esfandyari et al. 2016 ; Moghaddar and Werf 2017 ) , whereas inclusion of the effect of inbreeding depression (shown later) does (Xiang et al. 2016).

13.3 Another parameterization

Now, we come back to the “intuitively” model fitting “biological” effect of markers,

\[\mathbf{y} = \mathbf{1}\mu + \mathbf{Ta} + \mathbf{Xd + e}\]

An additive effect \(a_{j}\) and a dominant effect \(d_{j}\) are included for each of the markers. The covariate \(t_{\text{ij}}\) is equal to 1, 0, -1, for SNP genotypes \(A_{1}A_{1}\), \(A_{1}A_{2}\) and \(A_{2}A_{2}\), respectively. For the dominant component, \(x_{\text{ij}}\) is equal to 0, 1, 0 for SNP genotypes \(A_{1}A_{1}\), \(A_{1}A_{2}\) and \(A_{2}A_{2}\), respectively. This model is based on “observed” genotypes and in particular in heterozygotes, so it can be called a “genotypic” model.

From this model proposed by (Su, Christensen, et al. 2012) , we can define \(\mathbf{u}^{\mathbf{*}}\) and \(\mathbf{v}^{\mathbf{*}}\) as the “genotypic” additive and dominant effects. So, we can write for a set of individuals \(\mathbf{u}^{\mathbf{*}}\mathbf{=}\mathbf{Ta}\) and \(\mathbf{v}^{\mathbf{*}}\mathbf{=}\mathbf{Xd}\).

Genotypic parameterization
Genotype Frequency Additive value Dominant value \[u^*\] \[v^*\]
\(A_1 A_1\) \[p^2\] \[a\] \[0\] \[(2-2p)a\] \[-2pqd\]
\(A_1 A_2\) \[2pq\] \[0\] \[d\] \[(1-2p)a\] \[(1-2pq)d\]
\(A_2 A_2\) \[q^2\] \[-a\] \[0\] \[ -2p a\] \[ -2pqd\]
Average \[(p - q)a\] \[2pqd\]

Note that \(\mathbf{u}^{\mathbf{*}}\) is not a breeding value because \(\mathbf{a}\) is NOT a substitution effect, is the part attributable to the additive “biological” effect of the marker. The incidence matrix \(\mathbf{T}\) corresponds to the incidence matrix \(\mathbf{Z}\) (used in the classical model defined in terms of breeding values and dominance deviations). However, the matrix \(\mathbf{X} \neq \mathbf{W}\) (\(\mathbf{W}\) is used in the classical model for the dominance deviations).

The variance of the genotypic additive value can be obtained as \(\sigma_{u^{*}}^{2} = {E\left( {u^{*}}^{2} \right) - \left\lbrack E(u^{*}) \right\rbrack}^{2}\), and idem for the variance of the genotypic dominant value \(\sigma_{v^{*}}^{2}\). Then

\[\sigma_{u^{*}}^{2} = \sum_{}^{}{2p_{j}q_{j}}\sigma_{a}^{2}\]

and

\[\sigma_{v^{*}}^{2} = \sum_{}^{}{2p_{j}q_{j}(1 - 2p_{j}q_{j})}\sigma_{d}^{2}\]

Quite different from

\[\sigma_{u}^{2} = \sum_{j = 1}^{\text{nsnp}}{\left( 2p_{j}q_{j} \right)\sigma_{a0}^{2}} + \sum_{j = 1}^{\text{nsnp}}\left( 2p_{j}q_{j}\left( q_{j} - p_{j} \right)^{2} \right)\sigma_{d0}^{2}\]

\[\sigma_{v}^{2} = \sum_{j = 1}^{\text{nsnp}}{\left( 2p_{j}q_{j} \right)^{2}\sigma_{d0}^{2}}\]

that we obtained before. The variances \(\sigma_{u^{*}}^{2}\) and \(\sigma_{v^{*}}^{2}\) estimated under the “genotypic” model as in Su et al. (2012) are NOT genetic variances. In particular, they do not include dominant effects, but by definition of breeding value, the reproductive value of an individual contains substitution effects, which contain dominant effects. Therefore, \(\sigma_{u}^{2}\) and \(\sigma_{v}^{2}\) are more useful for selection.

Vitezica et al. (2013) showed that also the dominant relationship matrices (D) are different between the classical (statistical) and the “genotypic” model. The parameterization is largely a matter of convenience, both models are able to explain the data \((\mathbf{y})\) but their interpretation is different. The classical model in terms of breeding values and substitution effects (statistical) is more adequate for selection (both for ranking animals and for predicting genetic improvement).

The only variance components comparable with pedigree-based estimates are \(\sigma_{u}^{2}\) and \(\sigma_{v}^{2}\) obtained from the statistical genomic model. Using variance components estimated from the “genotypic” model is misleading because they underestimate the importance of additive variance and overestimate the importance of dominance variance (Zulma G. Vitezica, Varona, and Legarra 2013). From the total genetic variance \(\sigma_{g}^{2}\) it can be verified that \(\sigma_{u}^{2} + \sigma_{v}^{2} = \sigma_{u^{*}}^{2} + \sigma_{v^{*}}^{2}\). Thus, it is simple to switch variance component estimates between “statistical” \((\sigma_{u}^{2}\ \text{and}\ \sigma_{v}^{2})\) and “biological” \((\sigma_{u^{*}}^{2} + \sigma_{v^{*}}^{2})\) models if the distribution of the allelic frequencies is available (Vitezica et al., 2013).

The “statistical” model of Vitezica et al. (2013). This means that introducing new genetic effect (e.g. additive vs. additive plus dominance) in the model does not change previous estimates. For instance: going from an additive to an additive + dominant model should not change much neither the estimates of variance components, nor the estimates of breeding values and dominant deviations. However, the “genotypic” model of Su et al. (2012) is not orthogonal. Including dominance may change greatly the estimate of additive values and variances, and in addition, the estimated additive values are not breeding values – they are “genotypic” additive values.

13.4 Inbreeding depression

Phenomena of inbreeding depression and heterosis may be explained by directional dominance (Lynch and Walsh 1998). In other words, higher percentage of positive than negative functional dominant effects \(d\) is expected to happen in reality.

With directional dominance, the mean of dominant effect \(\mathbf{d}\) is different from zero. However, typically models assume that \(\mathbf{a}\) and \(\mathbf{d}\) have zero means. Xiang et al. (2016) show that inclusion of genomic inbreeding (based on SNPs and included as a covariate) accounts for directional dominance and inbreeding depression.

Xiang et al. (2016) proposed to write the model including (biological) additive and dominant effects of the markers as

\[\mathbf{y} = \mathbf{1}\mu + \mathbf{Ta} + \mathbf{X}\mathbf{d}^{\mathbf{*}}\mathbf{+ X}\mathbf{1}\mu_{d}\mathbf{+ e}\]

where \(\mathbf{d}^{\mathbf{*}}\mathbf{=}\mathbf{d}\mathbf{-}E\left( \mathbf{d} \right) = \mathbf{d} - \mu_{d}\) and the matrix \(\mathbf{X}\) has a value of 1 at heterozygous loci for an individual and 0 otherwise.

The term \(\mathbf{X}\mathbf{1}\) defined as \(\mathbf{h}\mathbf{=}\mathbf{X}\mathbf{1}\) contains the row sums of \(\mathbf{X}\), i.e. individual heterozygosities (how many markers are heterozygotes for each individual). The genomic inbreeding coefficient \(\mathbf{f}\) can be calculated as: \(\mathbf{f}\mathbf{=}\mathbf{1}\mathbf{-}\mathbf{h}\mathbf{/}N\), where \(N\) is the number of markers. For instance, \(\mathbf{f}\) is a vector that contains the percentage of homozygous loci for each individual. Then,

\[\mathbf{h} = \left( \mathbf{1 - f} \right)N = \mathbf{1}N + \mathbf{f}\left( - N \right)\]

and with the mean \(\mu_{d}\)

\[\mathbf{h}\mu_{d} = \left( \mathbf{1 - f} \right)N\mu_{d} = \mathbf{1}N\mu_{d} + \mathbf{f}\left( - N\mu_{d} \right)\]

Thus, the model can be rewritten as

\[\mathbf{y} = \mathbf{1}\mu + \mathbf{Ta} + \mathbf{X}\mathbf{d}^{\mathbf{*}}\mathbf{+ 1}N\mu_{d}\mathbf{+ f( -}N\mu_{d}\mathbf{) + e}\]

and finally

\[\mathbf{y} = \mathbf{1}\mu^{*} + \mathbf{Ta} + \mathbf{X}\mathbf{d}^{\mathbf{*}}\mathbf{+ f}b\mathbf{+ e}\]

where the term \(\mathbf{1}N\mu_{d}\) is confounded with the overall mean of the model \((\mu^{*}\mathbf{)}\mathbf{,}\) while the \(\mathbf{f}\left( - N\mu_{d} \right)\) models the inbreeding depression and \(b = \left( - N\mu_{d} \right)\) is the inbreeding depression summed over the marker loci, which is to be estimated.

This important result means that genomic inbreeding can be used to model directional dominance. This model allows to obtain estimates of inbreeding depression parameter in different populations (e.g. breeds or lines) and also in crossbred animals (Xiang et al. 2016).

Inclusion of genomic inbreeding must always be done in order to obtain a correct estimation of genetic dominance variance \((\sigma_{v}^{2})\). Otherwise, the genetic dominance variance is inflated. This was confirmed in real data by (Xiang et al. 2016 ; Aliloo et al. 2016). This has long been known for pedigree analysis (e.g. (De Boer and Hoeschele 1993)); even if dominance is not considered, inbreeding may be considered in genomic evaluations.

13.5 Genomic relationship matrices in absence of HWE

In the classical or “statistical” model that we showed previously, the effects (additive or breeding values, dominance deviations and epistatic deviations) are all orthogonal in linkage and Hardy-Weinberg equilibrium (HWE). What does the orthogonal property of the model mean? It means that the estimation of one genetic (e.g. additive) effect is not affected by the presence or absence of other genetic effects in the model (e.g. dominance or epistasis).

This property results in orthogonal partition of the variances. Why? because, substitution effect contributes to the additive genetic variance, the dominance deviation contributes to the dominance genetic variance, etc. There is no covariance between the genetic effects. In other words, introducing new genetic effect (e.g. additive vs. additive plus dominance) in the model does not change previous estimates. For instance: going from an additive to an additive + dominant model should not change much neither the estimates of variance components, nor the estimates of breeding values and dominant deviations.

Crossbreeding schemes are widely used in animal breeding (e.g. pigs, chickens) for the purpose of exploiting the heterosis and breed complementarity that often occur in crosses. Theses crosses (e.g. F1) or inbred populations are not in Hardy Weinberg equilibrium. Therefore, we need methods for genomic predictions, preferably including dominance, in these populations.

Additive \((\mathbf{G})\) and dominance deviation \((\mathbf{D})\) relationship matrices can be built removing the requirement of Hardy Weinberg equilibrium and assuming linkage equilibrium (Vitezica et al., 2017). This generalization of classical model is based in the NOIA orthogonal approach (Álvarez-Castro and Carlborg 2007).

So, the breeding values of a set of individuals are \(\mathbf{u} = \mathbf{Z}\alpha\) where \(\mathbf{\alpha}\) are dominant deviations, and the element of \(\mathbf{Z}\) for an individual \(i\) at the marker \(j\) is

\[\mathbf{z}_{ij} = \left\{ \begin{matrix} \ \ - ( - p_{A_{1}A_{2}} - 2p_{A_{2}A_{2}}) \\ - (1 - p_{A_{1}A_{2}} - 2p_{A_{2}A_{2}}) \\ - (2 - p_{A_{1}A_{2}} - 2p_{A_{2}A_{2}}) \\ \end{matrix} \right.\ \mathrm{\text{\ \ \ \ for\ genotypes\ }}\left\{ \begin{matrix} A_{1}A_{1} \\ A_{1}A_{2} \\ A_{2}A_{2} \\ \end{matrix} \right.\]

and the dominance deviation is \(\mathbf{v} = \mathbf{\text{Wd}}\) with the element of \(\mathbf{W}\) for an individual \(i\) at the marker \(j\) is

\[\mathbf{w}_{\text{ij}} = \left\{ \begin{matrix} - \frac{2p_{A_{1}A_{2}}p_{A_{2}A_{2}}}{p_{A_{1}A_{1}} + p_{A_{2}A_{2}} - \left( p_{A_{1}A_{1}} - p_{A_{2}A_{2}} \right)^{2}} \\ \frac{4p_{A_{1}A_{1}}p_{A_{2}A_{2}}}{p_{A_{1}A_{1}} + p_{A_{2}A_{2}} - \left( p_{A_{1}A_{1}} - p_{A_{2}A_{2}} \right)^{2}} \\ - \frac{2p_{A_{1}A_{1}}p_{A_{1}A_{2}}}{p_{A_{1}A_{1}} + p_{A_{2}A_{2}} - \left( p_{A_{1}A_{1}} - p_{A_{2}A_{2}} \right)^{2}} \\ \end{matrix}\mathrm{\text{\ \ \ \ \ for\ genotypes\ \ }}\left\{ \begin{matrix} A_{1}A_{1} \\ A_{1}A_{2} \\ A_{2}A_{2} \\ \end{matrix} \right.\ \right.\]

where \(p_{A_{1}A_{1}},\ p_{A_{1}A_{2}}\) and \(p_{A_{2}A_{2}}\) are the genotypic frequencies for the genotypes \(A_{1}A_{1}\), \(A_{1}A_{2}\) and \(A_{2}A_{2}\). Under the assumption of HWE, the “statistical” model presented before (as in Vitezica et al., 2013) is a particular case of this model where \(p_{A_{1}A_{1}} = p^{2},\ \ p_{A_{1}A_{2}} = 2\text{pq}\) and \(p_{A_{2}A_{2}} = q^{2}\), and the denominator \(p_{A_{1}A_{1}} + p_{A_{2}A_{2}} - \left( p_{A_{1}A_{1}} - p_{A_{2}A_{2}} \right)^{2} = 2\text{pq}\).

The additive relationship matrix is as:

\[Var\left( \mathbf{u} \right) = \frac{\mathbf{Z}\mathbf{Z}^{\mathbf{'}}}{tr(\mathbf{Z}\mathbf{Z}^{\mathbf{'}})/n}\sigma_{u}^{2} = \mathbf{G}\sigma_{u}^{2}\]

where \(\text{tr}\) is the trace of the matrix and \(n\) is the number of individuals. In a Hardy Weinberg population, \(\text{tr}(\mathbf{Z}\mathbf{Z}^{\mathbf{'}})\) corresponds to the heterozygosity of the markers \(2\sum\text{pq}\).

For the dominance deviations, the relationship matrix is

\[Var\left( \mathbf{v} \right) = \frac{\mathbf{W}\mathbf{W}^{\mathbf{'}}}{tr(\mathbf{W}\mathbf{W}^{\mathbf{'}})/n\ }\sigma_{v}^{2} = \mathbf{D}\sigma_{v}^{2}\]

The \(\text{tr}(\mathbf{W}\mathbf{W}^{\mathbf{'}})\) corresponds to the square of the heterozygosity of the markers \(4\sum\left( \text{pq} \right)^{2}\) in Hardy-Weinberg equilibrium (Zulma G. Vitezica, Varona, and Legarra 2013).

Now, we know how to build a model that allows the orthogonal decomposition of the variances in any population in Hardy Weinberg equilibrium or not, and thus the correct estimation of genetic variance components (equivalent to pedigree-based estimates).

13.6 Epistatic genomic relationships

The traditional definition of epistasis is the interaction of the genes, pairwise or higher-order interactions. In fact, the number of epistatic effects and accordingly the number of parameters in the model may be extremely large. Thus, we can define epistatic relationship matrices for individuals as we do in GBLUP, which is more efficient from the computational point of view.

Following this idea, Cockerham (1954) suggested to use the Hadamard product of the additive and dominant relationship (pedigree based) matrices to obtain the epistatic relationship matrices. Remember that the Hadamard product \(( \odot )\) between two matrices \((\mathbf{B} \odot \mathbf{C})\) produces another matrix \((\mathbf{A} = \mathbf{B} \odot \mathbf{C},\) with the same dimension) where each element of \(\mathbf{A}\) is the product of elements \((a_{\text{ij}} = b_{\text{ij}}*c_{\text{ij}})\) of the two original matrices \((\mathbf{B}\text{\ and\ }\mathbf{C})\).

The construction of the epistatic relationship matrices using the Hadamard product depends on the assumption of Hardy Weinberg equilibrium in other words, non-inbreeding and random mating (C. Clark Cockerham 1954). The Hadamard product relies on the orthogonal property of the model because no covariance exists between main genetic effects (e.g. additive and epistatic effects). Henderson (C. R. Henderson 1985) suggested the use of these matrices in BLUP.

Henderson’s approach was extended to the genomic framework by Xu (2013) for an F2 design and used for predicting hybrid performance in a rice F2 population (Xu, Zhu, and Zhang 2014). However, their extension is not general. It is a particular case because genotype frequencies in the F2 are the Hardy Weinberg frequencies corresponding to the allele frequency in the F1 (Falconer and Mackay 1996).

If we assume a more general situation with or without Hardy Weinberg equilibrium, we need to check if Hadamard product of genomic matrices is equivalent to the direct estimation of loci-based epistatic effects.

We have the following linear model with epistasis:

\[\mathbf{y} = \mathbf{1}\mu + \mathbf{u} + \mathbf{v} + \sum_{i = A,D}^{}{\sum_{j = A,D}^{}\mathbf{g}_{\mathbf{\text{ij}}}} + \sum_{i = A,D}^{}{\sum_{j = A,D}^{}{\sum_{k = A,D}^{}\mathbf{g}_{\mathbf{\text{ijk}}}}} + \ldots + \mathbf{e}\]

where \(\mathbf{u}\) is the additive or breeding value, \(\mathbf{v}\) is the dominance deviations, \(\mathbf{g}_{\mathbf{\text{ij}}}\) is the second order epistatic effect and \(\mathbf{g}_{\mathbf{\text{ijk}}}\) the third-order epistatic effect and so on; and \(\mathbf{e}\) is a residual vector. The second-order epistatic genetic effects can be partitioned into additive-by-additive \((\mathbf{g}_{\mathbf{\text{AA}}})\), additive-by-dominant \((\mathbf{g}_{\mathbf{\text{AD}}})\) and dominant-by-dominant \((\mathbf{g}_{\mathbf{\text{DD}}})\). The third-order epistatic genetic effects can be included in the model, but they are as either negligible (W. g. Hill and Mäki-Tanila 2015) or too difficult to estimate. Note that this genomic model includes “genetic” effects.

For obtaining genetic variance component estimations comparable to pedigree-based variances, a full orthogonal “statistical” model was proposed by (L. Varona 2014 ; Z. G. Vitezica et al. 2017). We have defined in the previous sections, the breeding values of a set of individuals as \(\mathbf{u} = \mathbf{Z}\alpha\) and the dominance deviation as \(\mathbf{v} = \mathbf{Wd}\). From here in the text, we rename \(\mathbf{Z}\) and \(\mathbf{W}\) as \(\mathbf{H}_{a}\) and \(\mathbf{H}_{d}\) respectively. Thus, \(\mathbf{u} = \mathbf{H}_{a}\mathbf{\alpha}\) and \(\mathbf{v} = \mathbf{H}_{d}\mathbf{d}\). As you see before, the matrix \(\mathbf{H}_{a}\) has \(n\) rows (number of individuals) and \(m\) columns (number of markers) containing “additive” coefficients, This matrix can be written as

\[\mathbf{H}_{a} = \begin{pmatrix} \mathbf{h}_{a_{k}} \\ \vdots \\ \mathbf{h}_{a_{n}} \\ \end{pmatrix}\]

where \(\mathbf{h}_{a_{k}}\) is a row vector for the k-th individual with m columns. For individual 1, the vector \(\mathbf{h}_{a_{1}}\) is equal to\(\ \left( h_{a_{11}},\ldots,h_{a_{1m}} \right)\).

Álvarez-Castro and Carlborg (2007) proved that the coefficients of the incidence matrix for second-order epistatic effects between two loci can be computed as the Kronecker products of the respective incidence matrices for single locus effects. So, for the interactions, such as additive-by-dominant interaction, the matrix \(\mathbf{H}_{\text{ad}}\) can be written using Kronecker products of each row of the preceding matrices as

\[\mathbf{H}_{\text{ad}} = \begin{pmatrix} \mathbf{h}_{a_{i}} \otimes \mathbf{h}_{d_{i}} \\ \mathbf{h}_{a_{i + 1}} \otimes \mathbf{h}_{d_{i + 1}} \\ \ldots \\ \mathbf{h}_{a_{n}} \otimes \mathbf{h}_{d_{n}} \\ \end{pmatrix}\]

For instance, for individual 1 the incidence matrix of additive-by-dominant epistatic effects is \(\mathbf{h}_{ad_{1}} = \mathbf{h}_{a_{1}} \otimes \mathbf{h}_{d_{1}}\). As example, we have 2 individuals and 3 loci,

and

\[\mathbf{H}_{\text{ad}} = \begin{pmatrix} \begin{matrix} h_{a_{11}}h_{d_{11}} & h_{a_{11}}h_{d_{12}} & h_{a_{11}}h_{d_{13}} \\ \end{matrix} & \begin{matrix} h_{a_{12}}h_{d_{11}} & h_{a_{12}}h_{d_{12}} & h_{a_{12}}h_{d_{13}} \\ \end{matrix} & \begin{matrix} h_{a_{13}}h_{d_{11}} & h_{a_{13}}h_{d_{12}} & h_{a_{13}}h_{d_{13}} \\ \end{matrix} \\ \begin{matrix} h_{a_{21}}h_{d_{21}} & h_{a_{21}}h_{d_{22}} & h_{a_{21}}h_{d_{23}} \\ \end{matrix} & \begin{matrix} h_{a_{22}}h_{d_{21}} & h_{a_{22}}h_{d_{22}} & h_{a_{22}}h_{d_{23}} \\ \end{matrix} & \begin{matrix} h_{a_{23}}h_{d_{21}} & h_{a_{23}}h_{d_{22}} & h_{a_{23}}h_{d_{23}} \\ \end{matrix} \\ \end{pmatrix}\]

The matrix \(\mathbf{H}_{\text{ad}}\) has as many columns as marker interactions (here, 9) and as many rows as individuals. This matrix is of very large size (e.g. for a 50K SNP chip and 1000 individuals the matrix contains \(1000\ x\ 50000^{2}\) elements). In addition, \(\mathbf{H}_{\text{ad}}\mathbf{H}_{\text{ad}}^{'}\) cross-product (that we need to compute for covariance matrices) is computationally expensive. Hopefully, an algebraic shortcut was found (Vitezica et al., 2017) that allows easy computation of \(\mathbf{H}_{\text{ad}}\mathbf{H}_{\text{ad}}^{'}\) and the rest of cross-products for epistatic matrices, even for third and higher orders.

The relationship matrices of epistatic genetic effects can be written as

\[\ Var\left( \mathbf{g}_{\text{AA}} \right) = \frac{\mathbf{H}_{\text{aa}}\mathbf{H'}_{\text{aa}}}{tr(\mathbf{H}_{\text{aa}}{\mathbf{H}^{\mathbf{'}}}_{\text{aa}})/n}\sigma_{g_{\text{AA}}}^{2} = \mathbf{G}_{\text{AA}}\sigma_{g_{\text{AA}}}^{2}\]

\[Var\left( \mathbf{g}_{\text{AD}} \right) = \frac{\mathbf{H}_{\text{ad}}\mathbf{H'}_{\text{ad}}}{tr(\mathbf{H}_{\text{ad}}{\mathbf{H}^{\mathbf{'}}}_{\text{ad}})/n}\sigma_{g_{\text{AD}}}^{2} = \mathbf{G}_{\text{AD}}\sigma_{g_{\text{AD}}}^{2}\]

\[Var\left( \mathbf{g}_{\text{DD}} \right) = \frac{\mathbf{H}_{\text{dd}}\mathbf{H'}_{\text{dd}}}{tr(\mathbf{H}_{\text{dd}}{\mathbf{H}^{\mathbf{'}}}_{\text{dd}})/n}\sigma_{g_{\text{DD}}}^{2} = \mathbf{G}_{\text{DD}}\sigma_{g_{\text{DD}}}^{2}\]

and with the algebraic shortcut as

\[Var\left( \mathbf{g}_{\text{AA}} \right) = \frac{\mathbf{G}_{A} \odot \mathbf{G}_{A}}{\text{tr}\left( \mathbf{G}_{A} \odot \mathbf{G}_{A} \right)/n}\sigma_{g_{\text{AA}}}^{2} = \mathbf{G}_{\text{AA}}\sigma_{g_{\text{AA}}}^{2}\]

\[Var\left( \mathbf{g}_{\text{AD}} \right) = \frac{\mathbf{G}_{A} \odot \mathbf{G}_{D}}{\text{tr}\left( \mathbf{G}_{A} \odot \mathbf{G}_{D} \right)/n}\sigma_{g_{\text{AD}}}^{2} = \mathbf{G}_{\text{AD}}\sigma_{g_{\text{AD}}}^{2}\]

\[Var\left( \mathbf{g}_{\text{DD}} \right) = \frac{\mathbf{G}_{D} \odot \mathbf{G}_{D}}{\text{tr}\left( \mathbf{G}_{D} \odot \mathbf{G}_{D} \right)/n}\sigma_{g_{\text{DD}}}^{2} = \mathbf{G}_{\text{DD}}\sigma_{g_{\text{DD}}}^{2}\]

using Hadamard products of additive and dominance genomic orthogonal relationships. A standardization based on the trace of the relationship matrices is needed. The normalization factor based on the traces was already used by Xu (2013) but several authors ignore it (e.g. (Muñoz et al. 2014)). Here the reasoning for pairwise interactions is present but it extends to third and higher order interactions, (e.g.,\(\ \mathbf{G}_{\text{AAD}} = \frac{\mathbf{G}_{A} \odot \mathbf{G}_{A} \odot \mathbf{G}_{D}}{\text{tr}\left( \mathbf{G}_{A} \odot \mathbf{G}_{A} \odot \mathbf{G}_{D} \right)/n}\)).

Note that this approach only assumes linkage equilibrium. In outbred populations (as animal populations), substantial LD (linkage disequilibrium) is present only between polymorphisms in tight linkage (W. g. Hill and Mäki-Tanila 2015).

Two other approaches are in the literature to model epistatic interactions. First, a “biological” non-orthogonal model has been proposed by Martini et al. (2016) but it can only be used for prediction and not for the estimation of variance components. Second, the RKHS (Reproducing Kernel Hilbert Space) approach (Daniel Gianola, Fernando, and Stella 2006). However, most kernels consider similarities within loci and not consider joint similarity across loci (Luis Varona et al. 2018).

13.7 Word of caution

It is quite easy to fit dominance, epistasis… in a GBLUP context when data sets are not too large. However, there is very little information, and most estimates of variance components have high standard errors. Also, the estimates of dominance and epistatic deviations are not of much accuracy. Thus, the researcher should be cautious when interpreting the results and using them in practice.

14 Single Step GBLUP

The idea for ssGBLUP came from the fact that only a small portion of the animals, in a given population, are genotyped. In this way, the best approach to avoid several steps would be to combine pedigree and genomic relationships and use this matrix as the covariance structure in the mixed model equations (MME). There are two derivations and both are very similar.

14.1 SSGBLUP as improved relationships

Legarra et al. (2009) stated that genomic evaluations would be simpler if genomic relationships were available for all animals in the model. Then, their idea was to look at A as a priori relationship and to G as an observed relationship; however, G is observed only for some individuals that have \(\mathbf{A}_{22}\) as a priori relationship. Based on that, they showed the genomic information could be extended to ungenotyped animals based on the joint distribution of breeding values of ungenotyped (\(\mathbf{u}_{1}\)) and genotyped (\(\mathbf{u}_{2}\)) animals:

\[p(\mathbf{u}_{1},\mathbf{u}_{2}) =p\left( \mathbf{u}_{2} \right)p(\mathbf{u}_{1}|\mathbf{u}_{2})\]

\[p(\mathbf{u}_{2}) = \ N(\mathbf{0},\mathbf{G})\]

If we consider that

\[\text{var}\left( \mathbf{u} \right) = \mathbf{A}\sigma_{u}^{2}\]

In the following, we can just omit \(\sigma_{u}^{2}\) in the derivations

\[\mathbf{A} = \begin{bmatrix} \mathbf{A}_{11} & \mathbf{A}_{12} \\ \mathbf{A}_{21} & \mathbf{A}_{22} \\ \end{bmatrix}\]

The conditional distribution of breeding values for ungenotyped and genotyped animals is

\[p(\mathbf{u}_{1}|\mathbf{u}_{2}) = \ N\left( \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{u}_{2},\mathbf{A}_{11} - \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{A}_{21} \right)\]

This can also be seen as

\[\mathbf{u}_{1}\mathbf{=}\mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{u}_{2} + \boldsymbol{\varepsilon}\]

With \(Var\left( \mathbf{\varepsilon} \right) = \mathbf{A}_{11} - \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{A}_{21}\mathbf{.}\)

Because the animals with subscript 1 have no genotypes, the variance depends on their pedigree relationships with genotyped animals. The derivation assumes multivariate normality of \(\mathbf{\varepsilon}\), which holds because these are overall values resulting from a sum of gene effects.

Using rules, variances and covariances are:

\[Var\left( \mathbf{u}_{1} \right) = var\left( \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{u}_{2} + \mathbf{\varepsilon} \right){= Var(\mathbf{A}}_{12}\mathbf{A}_{22}^{- 1}\mathbf{u}_{2}) + Var(\mathbf{\varepsilon})\]

\[{= \mathbf{A}}_{12}\mathbf{A}_{22}^{- 1}\mathbf{G}\mathbf{A}_{22}^{- 1}\mathbf{A}_{21} + \mathbf{A}_{11} - \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{A}_{21}\]

Rearranging:

\[{= \mathbf{A}_{11} + \mathbf{A}}_{12}\mathbf{A}_{22}^{- 1}\mathbf{G}\mathbf{A}_{22}^{- 1}\mathbf{A}_{21} - \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{A}_{21}\]

\[{= \mathbf{A}_{11} + \mathbf{A}}_{12}\mathbf{A}_{22}^{- 1}\mathbf{G}\mathbf{A}_{22}^{- 1}\mathbf{A}_{21} - \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{I}\mathbf{A}_{21}\]

\[{= \mathbf{A}_{11} + \mathbf{A}}_{12}\mathbf{A}_{22}^{- 1}\mathbf{G}\mathbf{A}_{22}^{- 1}\mathbf{A}_{21} - \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}{\mathbf{A}_{22}\mathbf{A}}_{22}^{- 1}\mathbf{A}_{21}\]

Therefore,

\[{Var\left( \mathbf{u}_{1} \right) = \mathbf{A}_{11} + \mathbf{A}}_{12}\mathbf{A}_{22}^{- 1}\left( \mathbf{G} - \mathbf{A}_{22} \right)\mathbf{A}_{22}^{- 1}\mathbf{A}_{21}\]

\[Var\left( \mathbf{u}_{2} \right) =\mathbf{G}\]

\[{Cov\left( \mathbf{u}_{1},\mathbf{u}_{2} \right) = Cov\left( \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{u}_{2},\mathbf{u}_{2} \right) = \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}Var\left( \mathbf{u}_{2} \right) = \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{G}}\]

Finally, the matrix that contains the joint relationships of genotyped and ungenotyped animals is given by (again, assuming for simplicity of presentation \(\sigma_{u}^{2} = 1\)):

\[ \begin{aligned} \mathbf{H} = & \begin{pmatrix} \text{var}\left( \mathbf{u}_{1} \right) & \text{cov}\left( \mathbf{u}_{1},\mathbf{u}_{2} \right) \\ \text{cov}\left( \mathbf{u}_{2},\mathbf{u}_{1} \right) & \text{var}\left( \mathbf{u}_{2} \right) \\ \end{pmatrix}\\ = & \begin{pmatrix} {\mathbf{A}_{11} + \mathbf{A}}_{12}\mathbf{A}_{22}^{- 1}\left( \mathbf{G} - \mathbf{A}_{22} \right)\mathbf{A}_{22}^{- 1}\mathbf{A}_{21} & \ \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{G} \\ \mathbf{G}\mathbf{A}_{22}^{- 1}\mathbf{A}_{21} & \mathbf{G} \\ \end{pmatrix} \\ = & \mathbf{A} + \begin{bmatrix} \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}{\left( \mathbf{G} - \mathbf{A}_{22} \right)\mathbf{A}}_{22}^{- 1}\mathbf{A}_{21} & \ \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\left( \mathbf{G} - \mathbf{A}_{22} \right) \\ \left( \mathbf{G} - \mathbf{A}_{22} \right)\mathbf{A}_{22}^{- 1}\mathbf{A}_{21} & \mathbf{G} - \mathbf{A}_{22} \\ \end{bmatrix} \end{aligned}\]

Which can be simplified to:

\[\mathbf{H}\ = \mathbf{A} + \begin{bmatrix} \mathbf{A}_{12}\mathbf{A}_{22}^{- 1} & \ \mathbf{0} \\ \mathbf{0} & \mathbf{I} \\ \end{bmatrix}\begin{bmatrix} \mathbf{I} \\ \mathbf{I} \\ \end{bmatrix}\left\lbrack \mathbf{G} - \mathbf{A}_{22} \right\rbrack\begin{bmatrix} \mathbf{I} & \mathbf{I} \\ \end{bmatrix}\begin{bmatrix} \mathbf{A}_{22}^{- 1}\mathbf{A}_{21} & \ \mathbf{0} \\ \mathbf{0} & \mathbf{I} \\ \end{bmatrix}\]

We usually assume in these notes that \(\mathbf{u}_{2} = \mathbf{Za}\), which leads to VanRaden’s G (P. M. VanRaden 2008). The derivation of (A. Legarra, Aguilar, and Misztal 2009) does not seem to require that G is actually VanRaden’s G (could potentially be something else), but when they use \(\mathbf{A}\) to model relationships, they assume an additive model. So, \(\mathbf{G}\) should be “additive”, which makes sense for VanRaden’s \(\mathbf{G}\) but also for similar matrices like \(\mathbf{G}_{\text{IBS}}\) or “corrected” \(\mathbf{G}_{\text{IBS}}\). One of the key assumptions of the methods in (A. Legarra, Aguilar, and Misztal 2009) is that \(E\left( \mathbf{u}_{2} \right) = 0\), (animals genotyped have 0 expected breeding value) which is not necessarily true if those animals are selected animals. We will see ways to deal with that later. Although H is very complicated, \(\mathbf{H}^{-1}\) is quite simple (I. Aguilar et al. 2010 ; O. F. Christensen and Lund 2010).

\(\mathbf{H}^{\mathbf{- 1}}\mathbf{=}\mathbf{A}^{\mathbf{- 1}}\mathbf{+}\begin{bmatrix} \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{G}^{\mathbf{- 1}}\mathbf{-\ }\mathbf{A}_{\mathbf{22}}^{\mathbf{- 1}} \\ \end{bmatrix}\)

14.2 SSGBLUP as linear imputations

Christensen and Lund (2010) proposed another derivation. They started by inferring the genomic relationship matrix for all animals using inferred (imputed) genotypes for non-genotyped animals; we have seen that this can be obtained using Gengler’s (N. Gengler, Mayeres, and Szydlowski 2007) method, modelling the genotype \(\mathbf{z}\) as a quantitative trait: \(\mathbf{z}\mathbf{=}\mathbf{1}\mu + \mathbf{\text{Wu}}\mathbf{+}\mathbf{e}\). If \(\mu\) (\(= 2p\) in the base population) is known, then we can linearly “impute” centered gene content for one marker as \({\widehat{\mathbf{z}}}_{1} = \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{z}_{2}\) , which extends to multiple markers as \({\widehat{\mathbf{Z}}}_{1} = \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{Z}_{2}\).

This provides the “best guess” of genotypes. We may then construct a “poor man” version of G using \(\mathbf{\hat{G}} = {\widehat{\mathbf{Z}}}_{1}{\widehat{\mathbf{Z}}}_{1}^{'}/\sum_{}^{}{2p_{j}q_{j}}\) . This matrix will be incorrect because when we impute, we get a guess – and the guess has an error. However, the missing data theory states that we need the joint distribution of these “guessed” genotypes. Assuming that multivariate normality holds for genotypes (this is an approximation, but very good when many genotypes are considered), the “best guess” is \(E\left( \mathbf{Z}_{1} \middle| \mathbf{Z}_{2} \right) = {\widehat{\mathbf{Z}}}_{1}\), and the conditional variance expressing the uncertainty about the “guess” is \({Var(\mathbf{\ }\widehat{\mathbf{Z}}}_{1}\left| \mathbf{Z}_{2} \right)\ = {\mathbf{(}\mathbf{A}}_{11}\mathbf{-}\mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{A}_{21})\mathbf{V}\) where \(\mathbf{V}\) contains \(2p_{k}q_{k}\) in the diagonal. These two results can be combined to obtain the desired augmented genomic relationships. For instance, for the non-genotyped animals,

\[Var\left( \mathbf{u}_{1} \right) = \sigma_{u}^{2}\left( \frac{{\widehat{\mathbf{Z}}}_{1}{{\widehat{\mathbf{Z}}}^{\mathbf{'}}}_{1}}{2\Sigma p_{k}q_{k}} + \mathbf{A}_{11}\mathbf{-}\mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{A}_{21} \right),\]

which equals

\[Var\left( \mathbf{u}_{1} \right) = \sigma_{u}^{2}\left( \mathbf{A}_{11}\mathbf{-}\mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{A}_{21} + \ \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{G}{\mathbf{A}_{22}^{- 1}\mathbf{A}}_{21} \right)\]

Finally, the augmented covariance matrix is

\[Var\begin{pmatrix} \mathbf{u}_{1} \\ \mathbf{u}_{2} \\ \end{pmatrix} = \sigma_{u}^{2}\mathbf{H},\]

where

\[\mathbf{H} = \begin{pmatrix} \mathbf{A}_{11}\mathbf{-}\mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{A}_{21} + \ \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{G}{\mathbf{A}_{22}^{- 1}\mathbf{A}}_{21} & \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{G} \\ \mathbf{G}{\mathbf{A}_{22}^{- 1}\mathbf{A}}_{21} & \mathbf{G} \\ \end{pmatrix},\]

is the augmented genomic relationship matrix with inverse

\[\mathbf{H}^{\mathbf{- 1}} = \mathbf{A}^{\mathbf{- 1}}\mathbf{+}\begin{pmatrix} \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{G}^{\mathbf{- 1}}\mathbf{-}\mathbf{A}_{22}^{- 1} \\ \end{pmatrix}\]

assuming that \(\mathbf{G}\) is invertible (this will be dealt with later). Therefore, by using an algebraic data augmentation of missing genotypes, Christensen and Lund (2010) derived a simple expression for an augmented genomic relationship matrix and its inverse, without the need to explicitly augment, or “guess”, all genotypes for all non-genotyped animals. The key hypothesis here is that the base population allele frequencies are known, which is not necessarily true.

14.3 ssGBLUP mixed model equations

Assuming the following animal model:

\[\mathbf{y} = \mathbf{Xb} + \mathbf{W}\mathbf{u} + \mathbf{e}\]

The MME for ssGBLUP become, for one trait:

\[\begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{X}\sigma_{e}^{- 2} & \mathbf{X}^{\mathbf{'}}\mathbf{W}\sigma_{e}^{- 2} \\ \mathbf{W}^{\mathbf{'}}\mathbf{X}\sigma_{e}^{- 2} & \mathbf{W}^{\mathbf{'}}\mathbf{W}\sigma_{e}^{- 2}\mathbf{+}\mathbf{H}^{- 1}\sigma_{u}^{- 2}\ \\ \end{pmatrix}\begin{pmatrix} \widehat{\mathbf{b}} \\ \widehat{\mathbf{u}} \\ \end{pmatrix} = \begin{pmatrix} \mathbf{X}^{'}\mathbf{y}\sigma_{e}^{- 2} \\ \mathbf{W}^{\mathbf{'}}\mathbf{y}\sigma_{e}^{- 2} \\ \end{pmatrix}\]

And for multiple traits:

\[\begin{pmatrix} \mathbf{X}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{X} & \mathbf{X}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{W} \\ \mathbf{W}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{X} & \mathbf{W}^{\mathbf{'}}\mathbf{R}^{\mathbf{-}1}\mathbf{\ W +}\mathbf{H}^{- 1}\bigotimes\mathbf{G}_{0}\ \\ \end{pmatrix}\begin{pmatrix} \widehat{\mathbf{b}} \\ \widehat{\mathbf{u}} \\ \end{pmatrix} = \begin{pmatrix} \mathbf{X}^{'}\mathbf{R}^{- 1}\mathbf{y} \\ \mathbf{W}^{\mathbf{'}}\mathbf{R}^{- 1}\mathbf{y} \\ \end{pmatrix}\]

where \(\mathbf{G}_{0}\) is the matrix of genetic covariance across traits, and usually \(\mathbf{R}\mathbf{=}\mathbf{I}\bigotimes\mathbf{R}_{0}\), where \(\mathbf{R}_{0}\) is the matrix of residual covariances. The formulation is as general as pedigree-based BLUP.

Some properties are:

14.4 Some properties of matrix H

Matrix H is full rank (invertible) matrix, because it can be formed as

\[\mathbf{H}\ = \mathbf{A} + \begin{bmatrix} \mathbf{A}_{12}\mathbf{A}_{22}^{- 1} & \ \mathbf{0} \\ \mathbf{0} & \mathbf{I} \\ \end{bmatrix}\begin{bmatrix} \mathbf{I} \\ \mathbf{I} \\ \end{bmatrix}\left\lbrack \mathbf{G} - \mathbf{A}_{22} \right\rbrack\begin{bmatrix} \mathbf{I} & \mathbf{I} \\ \end{bmatrix}\begin{bmatrix} \mathbf{A}_{22}^{- 1}\mathbf{A}_{21} & \ \mathbf{0} \\ \mathbf{0} & \mathbf{I} \\ \end{bmatrix}\]

which is a full rank matrix. However, for \(\mathbf{H}\) to be positive definite (which is the requisite for using its inverse in MME), \(\mathbf{G} - \mathbf{A}_{22}\) needs to be positive definite. It usually is – maybe after some adjustments for compatibility that will be used later.

The inverse

\[\mathbf{H}^{\mathbf{- 1}} = \mathbf{A}^{\mathbf{- 1}}\mathbf{+}\begin{pmatrix} \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{G}^{\mathbf{- 1}}\mathbf{-}\mathbf{A}_{22}^{- 1} \\ \end{pmatrix}\]

is also full rank, but for it to be positive definite, it needs \(\mathbf{G}^{\mathbf{-}\mathbf{1}}\mathbf{-}\mathbf{A}_{22}^{- 1}\) to be positive definite. Again, if things are done properly, it usually is.

Construction of \(\mathbf{H}^{\mathbf{-}\mathbf{1}}\) is simple, because it follows four steps:

  1. Build \(\mathbf{A}^{\mathbf{- 1}}\) using Henderson’s rules

  2. Build \(\mathbf{G}\) and invert it

  3. Build \(\mathbf{A}_{22}\) and invert it

The matrix \(\mathbf{A}_{22}\) is the relationship matrix of genotyped individuals. This matrix can be constructed using the tabular method, but this is very costly for large data sets. A better option is to use either recursions (I. Aguilar and Misztal 2008) or Colleau (2002) algorithm. Several strategies were described by Aguilar et al. (2011). We remind also that the \(\mathbf{A}_{22}^{- 1}\) is not the corresponding block of \(\mathbf{A}^{\mathbf{-}\mathbf{1}}\), in other words, it has to be constructed and inverted explicitely.

The diagonal in \(\mathbf{G}^{\mathbf{-}\mathbf{1}}\mathbf{-}\mathbf{A}_{22}^{- 1}\) is usually positive. This implies (roughly) that there is more information in \(\mathbf{G}\) than in \(\mathbf{A}_{22}\), because \(\mathbf{G}\) captures realized relationships.

Matrix \(\mathbf{H}^{1}\) is rather sparse. Consider the following two examples:

When the number of animals genotyped is very large (larger than the number of markers), matrix \(\mathbf{G}\) gets rather big. For this reason, there are other formulations of ssGBLUP that will be presented later.

Matrix \(\mathbf{H}\) above can be seen as a modification of regular pedigree relationships to accommodate genomic relationships. For instance, two seemingly unrelated individuals will appear as related in H if their descendants are related in G. Accordingly, two descendants of individuals that are related in G will be related in H, even if the pedigree disagrees. Indeed, it has been suggested to use H in mating programs to avoid inbreeding (C. Sun et al. 2013).

Contrary to common intuition from BLUP or GBLUP, genotyped animals without phenotype or descendants should not be eliminated from matrix H unless both parents are genotyped. The reason is that (unless both parents are genotyped) these animals potentially modify pedigree relationship across other animals, possibly notably their parents. For instance, imagine two half-sibs, offspring of one sire mated to two non-genotyped, unrelated cows. If these two half sibs are virtually identical, H will include this information and the cows will be made related (even identical) in H.

14.4.1 Inbreeding in H

The diagonal elements of \(\mathbf{A}\) contain inbreeding expressed as \(F_{i} = A_{\text{ii}} - 1\). The diagonal elements of \(\mathbf{G}\) contain genomic inbreeding expressed as \(F_{i} = G_{\text{ii}} - 1\). Inbreeding is useful to handle genetic variability, and also to compute model based reliabilities as \(rel_{i} = 1 - \frac{\text{PEV}}{\left( 1 + F_{i} \right)\sigma_{u}^{2}}\). Because \(\mathbf{H}\) uses all information, the best estimate of inbreeding combining pedigree and genomic information is actually \(F_{i} = H_{\text{ii}} - 1\). Note that we have efficient methods to obtain \(\mathbf{H}^{- 1}\) but we do not have efficient methods to obtain \(\mathbf{H}\) or its diagonal. Xiang et al. (2017) obtained the diagonal of \(\mathbf{H}\) by computing the sparse inverse of \(\mathbf{H}^{- 1}\) as is programmed in YAMS. Anyway, we do not know -yet- how to efficiently obtain “H-inbreeding”. Recently, Colleau et al. (2017) proposed a method that allows to compute overall statistics of \(\mathbf{H}\) such as total relationship. The method is quite involved numerically but so far it is the only existing option.

14.5 Mixing G and A: blending and compatibility of pedigree and genomic relationships.

This is a very important chapter because lots of people (including famous researchers) confuse “compatibility”, which tries to put G and A in the same scale, and “blending”, which is basically a technique used to assign part of the genetic variance to pedigree – not markers, and at the same time used to have an invertible G.

14.5.1 Blending

14.5.1.1 Blending to include the residual polygenic effect

In previous chapter for GBLUP, we saw that we can model the total genetic effect as based, partly, on pedigree, and partly on genomic relationships. Let us decompose the breeding values of all individuals in a part due to markers and a residual part due to pedigree, \(\mathbf{u}\mathbf{=}\mathbf{u}_{m}\mathbf{+}\mathbf{u}_{p}\) with respective variances \(\sigma_{u}^{2} = \sigma_{u,m}^{2} + \sigma_{u,p}^{2}\). The “marker-based” part will have a relationship matrix H in Single Step, whereas the “pedigree-based” part will have a relationship matrix A.

It follows that \(Var\left( \mathbf{u} \right) = \left( \left( 1 - \alpha \right)\mathbf{H}\mathbf{+}\alpha\mathbf{A} \right)\sigma_{u}^{2}\) where \(\alpha = \sigma_{u,m}^{2}/\sigma_{u}^{2}\). In practice, in the SSGBLUP, it is easier to create a modified genomic relationship matrix \(\mathbf{G}_{w}\) (\(\mathbf{G}\) in (I. Aguilar et al. 2010); \(\mathbf{G}_{w}\) in (P. M. VanRaden 2008 ; O. F. Christensen 2012) ) as \(\mathbf{G}_{w} = \left( 1 - \alpha \right)\mathbf{G}\mathbf{+}\alpha\mathbf{A}_{22}\).

This is known as “blending”. In practice, the value of \(\alpha\) is low (values oscillate between 0.05 and 0.7) and has mostly negligible effects on predictions. It has been claimed that blending reduces bias of predictions, and this results in different optimal \(\alpha\) coefficients for different traits and species. It seems strange to accept that markers would describe one trait up to 95% of its variance, and for the same animals, only 70%. A more coherent approach is to estimate the two components in \(\sigma_{u}^{2} = \sigma_{u,m}^{2} + \sigma_{u,p}^{2}\) by REML (O. F. Christensen and Lund 2010), fitting explicitly two separate random effects, \(\mathbf{u}_{m}\mathbf{+}\mathbf{u}_{p}\) with respective covariance matrices \(\mathbf{H}\sigma_{u,m}^{2}\) and \(\mathbf{A}\sigma_{u,p}^{2}\), and estimate explicitly \(\sigma_{u,m}^{2}\) and \(\sigma_{u,p}^{2}\).

14.5.1.2 Blending to make G invertible

Matrix G is often not invertible, and therefore we can come up with “tricks” to make it invertible. The simplest trick is to add a small constant, say 0.01, to the diagonal of G:

\[\mathbf{G}_{w} \leftarrow \mathbf{G} + 0.01\mathbf{I}\]

Which gives \(\mathbf{G}_{w}\) nearly identical to \(\mathbf{G}\): \(\mathbf{G}_{w} \approx \mathbf{G}\) and therefore \(\mathbf{G}_{w} \approx \frac{\mathbf{Z}\mathbf{Z}^{\mathbf{'}}}{\sum\left( 2p_{j}q_{j} \right)}\).

Alternatively, we can use the “blending” with the relationship matrix as above

\[\mathbf{G}_{w} \leftarrow \left( 1 - \alpha \right)\mathbf{G +}\alpha\mathbf{A}_{22}\]

Here, \(\mathbf{G}_{w}\) is “less” close to the original G and to \(\frac{\mathbf{Z}\mathbf{Z}^{\mathbf{'}}}{\sum\left( 2p_{j}q_{j} \right)}\). This has repercussions for the backsolving of SNP effects. For this reason, “blending” with the relationship matrix is becoming more cumbersome for some computational strategies such as APY or SNP-based models.

14.6 Compatibility of G and A

14.6.1 Fitting G to A

Based on the way H is constructed, the central element is \(\mathbf{G} - \mathbf{A}_{22}\), which implies both matrices should be compatible (Andres Legarra et al. 2014). VanRaden (2008) stated that G had to be constructed with base allele frequencies. However, genomic relationships can be biased if G is constructed based on allele frequencies other than the ones calculated from the base population (P. M. VanRaden 2008). Allele frequencies from the base population are not known because of the recent recording of pedigrees (i.e., the base population per se is unknown). In some cases, such as dairy cattle, base allele frequencies can be inferred – in other cases such as pigs or sheep, they can not. In the typical case, the allele frequencies are observed a few generations after the start of predigree recording. Allele frequencies \(p\) will tend to fixation to the closest extreme (1 or 0) if they are neutral, and towards the favorable allele if they have effects.

Most commonly, allele frequencies used to construct G are based on the observed population. This brings two problems. The first one is that the machinery of “linear imputation” of Christensen and Lund (2010) fails: the expression \({\widehat{\mathbf{Z}}}_{1} = \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{Z}_{2}\) as the mean is not 0.

The second problem is that we assumed in the development that the expectation of breeding values for genotyped animals is 0. If the population is under selection, recent animals should have higher genetic values than the base generation. Thus, the assumption \(\mathbf{u}_{2}\sim N(\mathbf{0},\mathbf{G}\sigma_{u}^{2})\) . A more sensible approach is to posit a mean for these animals: \(\mathbf{u}_{2}\sim N(\mathbf{1}\mu,\mathbf{G}\sigma_{u}^{2})\) (Vitezica et al. 2011). In the chapter about genomic relationships, we have seen that if \(\mu\) is a random effect, this leads to a genomic relationship matrix: \(\mathbf{G}^{\mathbf{*}}\mathbf{=}\mathbf{G} + \mathbf{1}\mathbf{1}^{\mathbf{'}}a\) where \(a = {\overline{\mathbf{A}}}_{22}\mathbf{-}\overline{\mathbf{G}}\); use of \(a\) leads to \(\mathbf{u}_{2}\sim N(\mathbf{0},\mathbf{G}^{\mathbf{*}}\sigma_{u}^{2})\) with mean 0. Equivalently, Vitezica et al. (2011) show that a model with explicit estimate of \(\mu\) leads to the same solution. The idea has been considered also with \(\mu\) fit as a fixed effect (Hsu, Garrick, and Fernando 2017).

In addition, and as shown in the chapter of genomic relationships, there is a decrease in the genetic variance. This leads to very similar adjustments

\[\mathbf{G}^{*} = a + b\mathbf{G}\]

with \(a\) and \(b\) inferred from 2 systems of equations:

\[\frac{\text{tr}\left( \mathbf{G} \right)}{m}b + a = \frac{\text{tr}\left( \mathbf{A}_{22} \right)}{m}\]

\[a + b\overline{\mathbf{G}}\mathbf{=}{\overline{\mathbf{A}}}_{22}\]

This adjustments account for genotyped animals being more related through \(\mathbf{A}_{22}\) than \(\mathbf{G}\) is able to reflect.

14.6.2 Fitting A to G

A second class of method is also detailed in the same chapter, and leads to modify \(\mathbf{A}_{22}\) to resemble \(\mathbf{G}\) rather than the opposite. Christensen (2012) argued that using any allele frequency is subject to uncertainty, and after algebraic integration of allele frequencies he devised a new pedigree relationship matrix, \(\mathbf{A}\left( \gamma \right)\) whose founders have a relationship matrix \(\mathbf{A}_{\text{base}} = \gamma + \mathbf{I}(1 - \gamma/2)\). Parameter \(s\), used in \(\mathbf{G}\mathbf{=}\mathbf{Z}\mathbf{Z}^{\mathbf{'}}/s\) can be understood as the counterpart of \(2\Sigma p_{i}q_{i}\) (heterozygosity of the markers) in the base generation.

Further developments by Garcia-Baccino et al. (2017) showed that the unknown \(s\) reduces to \(s = 2\sum{0.5}^{2} = m/2\), simply the number of markers divided by 2, and \(\gamma = 8\text{var}\left( p_{\text{base}} \right)\), where \(p_{\text{base}}\) are the (say, 50K) base allele frequencies. It may be argued that we still need to estimate \(p_{\text{base}}\) ; true, but using one inferred parameter \(\gamma\) to modify A instead of using 50,000 inferred parameters in \(p_{\text{base}}\) to construct G seems a safer strategy. Also, \(\gamma\) needs to be estimated only once and not at each run SSGBLUP when new genotypes are available. Both papers present methods to estimate \(\gamma\), but the simpler strategy is to estimate \(p_{\text{base}}\) using Gengler’s method and then compute \(\widehat{\gamma} = 8\text{var}\left( {\widehat{p}}_{\text{base}} \right)\). The method has interesting connections with Wright’s \(F_{\text{st}}\) theory (Garcia-Baccino et al. 2017) and with genetic distances across populations.

14.6.3 Unknown Parent Groups

Imagine that you are in Europe in 1975 and there is a massive introduction of selected US Holstein bulls into the less-selected European “Friesian” population, as described for instance here. US data are not available, but you want to fit in your model the fact that some groups of “parents” are really different from others. What you do – you assign a “pseudo-parent” at the top of the US pedigree, and a “pseudo-parent” at the top of the European pedigree. These pseudo-parents are not animals per se; they are conceived as infinite pools of animals to draw descendants from; their descendants are not inbred nor related.

This is, presented in a caricature, the origin of Genetic Groups or Unknown Parent Groups (UPGs hereinafter). For more details, go to regular texts on genetic evaluation (e.g.(Mrode and Thompson 2005)) and to Quaas (1988) and Thompson (1979) classic papers. Application of UPG goes through a special matrix \(\mathbf{A}^{*}\) that accomplishes the role of “regular” \(\mathbf{A}^{-1}\) in BLUP. This matrix is:

\[\mathbf{A}^{*} = \begin{pmatrix} \mathbf{Q'A^{- 1}Q} & \mathbf{- Q'A^{- 1}} \\ \mathbf{-A}^{- 1}\mathbf{Q} & \mathbf{A}^{- 1} \\ \end{pmatrix}\]

and includes UPG in the upper corner. This matrix has no inverse and therefore the “relationship” matrix with groups does not exist (this is indeed awkward).

Unknown Parent Groups are used extensively to model:

  1. Missing parentship, as in sheep (father is often unknown). Genetic Groups are often defined by year of birth to model genetic progress.

  2. Importations, or introduction of foreign material (as in pig companies). Genetic Groups are often defined by country of origin.

  3. Crosses (e.g. Angus x Gelbvieh). Genetic Groups are often defined by breed.

The key bit for what we want in these notes is to realize that, in the theory of Unknown Parent Groups,

\[p\left( \mathbf{u}_{2} \right) = N\left( \mathbf{Qg},\mathbf{A}\sigma_{u}^{2} \right)\]

with \(\mathbf{g}\) the “breeding value” of the unknown parent group, \(\mathbf{Q}\) containing fractions of origin, for instance an animal could have 10% of its genes from “Lacaune France 2002”, 15% of its genes from “Lacaune 2000”, 25% from “Lacaune 2004”, and 50% of its genes from “Lacaune 1996”.

Our problem now is that, when doing genomic predictions, we assumed

\[p\left( \mathbf{u}_{2} \right) = N\left( \mathbf{0,G}\sigma_{u}^{2} \right)\]

Then, why do we need UPG if we have G and G is replacing the pedigree information? How can we conciliate these two definitions for \(\mathbf{u}_{2}\)? The developments that lead to SSGBLUP fail because we assumed that

\[p(\mathbf{u}_{1}|\mathbf{u}_{2}) = \ N\left( \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{u}_{2},\mathbf{A}_{11} - \mathbf{A}_{12}\mathbf{A}_{22}^{- 1}\mathbf{A}_{21} \right)\]

But this is no longer true in presence of Unknown Parent Groups (in part because of the fact that \(\mathbf{A}^{*}\) cannot be inverted). Some options were reviewed by Misztal et al. (2013). The idea of metafounders was published by Legarra et al. (2015). We discuss them quickly here. This is still an open topic for research.

14.6.3.1 Truncate pedigree and data

The simpler option is to remove old data as in (D. Lourenco et al. 2014). If your UPGs model genetic trend, but you’re only interested in recent animals, you can remove “old” data and then trace back your pedigree from your data by 3 generations. And you don’t use UPGs. This is simple – yet efficient. In addition, in this case A and G match almost automatically because the base generation in the (truncated) pedigree is very close to the genotyped animals.

14.6.3.2 Approximate UPGs

The default option in blupf90, when there are UPGs, is to build a matrix \(H^{*}\) as follows:

\[\mathbf{H}^{*} = \mathbf{A}^{*} + \begin{pmatrix} \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{G}^{- 1} - \mathbf{A}_{22}^{- 1} \\ \end{pmatrix}\]

Where \(\mathbf{A}_{22}\) is constructed as if UPG don’t exist, which is an approximation. This works usually well unless you have many animals (e.g. cows or sheep) that have some unknown parent. In this case, there are three other solutions.

14.6.3.3 Fitting UPG as covariates

This is simple yet cumbersome. The model fit for genomic evaluation does not use \(\mathbf{A}^{*}\), but it fits UPG as covariates:

\[\mathbf{y = Xb + Qg + Zu + e}\]

With a matrix with covariates and using “regular” \(\mathbf{H}^{- 1} = \mathbf{A}^{- 1} + \begin{pmatrix} 0 & 0 \\ 0 & \mathbf{G}^{- 1} - \mathbf{A}_{22}^{- 1} \\ \end{pmatrix}\) for u. The final estimate is \({\widehat{\mathbf{u}}}^{\mathbf{*}}\mathbf{=}\widehat{\mathbf{u}}\mathbf{+}\mathbf{Q}\widehat{\mathbf{g}}\).

This model is not quite right. If G “contains” all needed information, for genotyped animals the group effect is counted twice: in Q and in G. Taking it to the extreme, in a GBLUP context, it would make no sense to put covariates.

14.6.3.4 Fitting “exact UPGs”

This is equivalent to the previous solution, but the \(\mathbf{\text{Qg}}\) part is embedded within \(\mathbf{H}^{*}\):

\[\mathbf{H}^{*} = \mathbf{A}^{*} + \begin{pmatrix} \mathbf{Q}_{2}^{'}\left( \mathbf{G}^{- 1} - \mathbf{A}_{22}^{- 1} \right)\mathbf{Q}_{2} & \mathbf{Q}_{2}^{'}\left( \mathbf{G}^{- 1} - \mathbf{A}_{22}^{- 1} \right) \\ \left( \mathbf{G}^{- 1} - \mathbf{A}_{22}^{- 1} \right)\mathbf{Q}_{2} & \mathbf{G}^{- 1} - \mathbf{A}_{22}^{- 1} \\ \end{pmatrix}\]

14.6.3.5 Metafounders

Legarra et al. (2015) suggested a different point of view. First, to apply Christensen (2012) theory to construct G as \(\mathbf{G}_{05}\), “pretending” that all allele frequencies are 0.5. Second, to substitute UPGs by pseudo-animals called metafounders, that have strange relationship coefficients called \(\mathbf{\Gamma}\). For instance, an example of these coefficients is \(\mathbf{\Gamma} = \begin{pmatrix} 0.7 & 0.4 \\ 0.4 & 0.5 \\ \end{pmatrix}\). These coefficients need to “match” observed relationships in \(\mathbf{G}_{05}\); for instance, if Landrace and Yorkshire have an average of \(0.4\) relationship in \(\mathbf{G}_{05}\), then \(\gamma_{\text{Landrace},\text{Yorks}h\text{ire}}\) should be 0.4. If Yorkshire animals that are unrelated based on pedigree have an average of \(0.7\) relationship in \(\mathbf{G}_{05}\), then \(\gamma_{\text{Yorks}h\text{ire},\text{Yorks}h\text{ire}} = 0.7\) . Based on this, we create \(\mathbf{A}^{\Gamma}\) and then we use

\[\mathbf{H}^{\Gamma - 1} = \mathbf{A}^{\Gamma - 1} + \begin{pmatrix} 0 & 0 \\ 0 & \mathbf{G}^{- 1} - \mathbf{A}_{22}^{\Gamma - 1} \\ \end{pmatrix}\]

We achieve two things here: (1) modelling different means and (2) automatic compatibility between A and G. This strategy was used by Xiang et al. (2017).

15 Use of method LR to assess potential bias due to design of cross-validation analysis.

Andres Legarra, INRAE, 28 Oct 2021.

This project has received funding from the European Unions’ Horizon 2020 Research & Innovation programme under grant agreement N°772787 -SMARTER.

15.1 Introduction

In genomic selection, use of early genomic proofs can lead to suboptimal selection decisions if there is bias (see below for description of bias). In sheep and goats, and in particular for traits expressed in females, there is a lack of good tools to evaluate the presence or absence of bias, and the methods to evaluate accuracy of genomic selection are suboptimal.

We do we concern about bias? Selection theory establishes that selection is optimal if each candidate to selection is compared fairly to each other. This means that across individuals, Estimated Breeding Values (EBV, \(\hat{u}\)) of the selected candidates is equal to the expectation of the (true) Breeding Values (BV, \(u\)). When animals are selected, this is true under two conditions: \(\bar{u} = \bar{\hat{u}}\) and \(cov(u,\hat{u})=var(\hat{u})\), where the means and the covariances apply across the animals selected in an operation (i.e. at the time of selecting young male lambs). The property \(cov(u,\hat{u})=var(\hat{u})\) is needed because if the distribution of \(\hat{u}\) of e.g. young animals is too (or not enough) spread, we will select too many (or too little) young animals. Note that at this point, these properties are not statistical and therefore are neither “frequentist” nor “Bayesian”. 6

These properties can be formalized as

  1. equality of estimated and true means :

\[\mathbf{1'\hat{u}} = \mathbf{1'u} \] or equivalently \(\frac{1}{n}\sum{\hat{u}_i}=\frac{1}{n}\sum{{u}_i}\) or still \(\bar{\hat{u}} = \bar{u}\), and

  1. slope of true on estimated equal to 1

\[\frac{1}{n}\sum{\left(\hat{u}-\overline{\hat{u}_i}\right)^2}=\frac{1}{n}\sum{[\left(\hat{u_i}-\bar{\hat{u}}\right)(u_i-\bar{u})]} \]

or equivalently \(cov(u,\hat{u})=var(\hat{u})\).

Henderson (Charles R. Henderson (1975), C. R. Henderson (1982)) established that the two properties above hold, even if there is selection, on expectation for one animal across repeated conceptual sampling of its \((u,\hat{u})\). Then Andres Legarra and Reverter (2018) proved that the proof applies to sets of EBVs from groups of animals, so we have that the two properties hold on expectation for many animals across repeated conceptual sampling. By the Law of Large Numbers, when the number of animals is large, a number converges to its expectation. This means that, for a large number of animals, \(\bar{\hat{u}} = \bar{u}\) must hold empirically.

So the theory says that, without invoking some esoteric statistical framework, genetic evaluations should be unbiased. But how can we check this? We don’t have \(u\), only \(\hat{u}\). In dairy cattle, they compare predictions vs. progeny proofs (or Daughter Yield Deviations) but in other species the number of offspring of each animal is small.

In addition, we’re interested in finding out the accuracy of genomic prediction, i.e. \(r(u,\hat{u})\). Again, it is difficult to obtain this number in small ruminant cases.

15.2 Bias due to using pre-corrected data or De-Regressed Proofs (DRP)

The following is extracted from Andres Legarra and Reverter (2018). Often we have used precorrected phenotypes \(y^{*}\) or deregressed proofs, and compare predictions \(\hat{y}\) with (precorrected) observations \(y^*\) (this method is sometimes called “predictability”). The estimator of accuracy is e.g. \(r \approx cor(y^{*},\hat{y})/h\) for \(h^2\) the heritability (Legarra et al. 2008). But this ignores that precorrection generates a covariance structure in \(y^{*}\), is very sensitive to low values of \(h^2\), and it also ignores that animals used in these studies can be preselected (case for instance of elite males). This leads to paradoxes:

15.2.1 Bias due to ignoring the effect of selection on genetic variance

It also ignores that candidates to selection have reduced genetic variance (Bijma (2012)). For instance, for prospective AI rams in dairy sheep, because they’re highly selected, their genetic variance is less than the “normal” genetic variance 7.

Consider for instance that we made a study on growth in meat sheep in selected rams in a performance recording station. These rams are selected based on parent average and therefore their genetic variance, is, say, \(k=80\%\) of the populational one, and \(h^2=0.3\). Through cross-validation we obtain \(cor(y^{*},\hat{y})=0.4\) and we conclude that \(r \approx cor(y^{*},\hat{y})/h = 0.73\). However this is incorrect because these animals were selected, so that in these rams, the heritability is actually \(h^{2*} = \frac{kh^2}{1-(1-k)h^2} \approx 0.26\). Coupling in our equation \(r \approx cor(y^{*},\hat{y})/h^{*} = 0.78\), quite higher.

15.2.2 Bias due to pre-correction by fixed effects

There is a second, non-negligible source of bias. We use \(y^{*}\) (precorrected data) as it was “exact”. This leads to overestimation of accuracies. In Andres Legarra and Reverter (2018)) we worked out that for a balanced design with \(n_i\) records per contemporary group, the bias is such that the relative overestimation of accuracy is of order \(\frac{1}{n_i}\). For instance:

15.3 Method LR to the rescue

For all these reasons we want better methods to assess biases and accuracies.

Andres Legarra and Reverter (2018), with further proofs by Bermann et al. (2021), extended the machinery developed by Charles R. Henderson (1975) and Reverter et al. (1994) to infer biases and accuracies by splitting the data set. They defined partial (p) and whole (w) data sets, so the partial data set contains all information until a given date and the whole data set contains all information available for the analyst until a later date (not necessarily now). The procedure, called LR from Linear Regression 8 is described next.

15.3.1 LR in a nutshell

You have complete (whole) records, pedigree and (perhaps) markers. Consider a cut-off date. Records before these date make the partial data set: \(\mathbf{y}_p\) whereas all records make the whole data set: \(\mathbf{y}_w\). Then you run two genetic evaluation with either the partial data or the whole data, and you keep the entire pedigree and markers in both. In these manner, you have EBVs for all animals in both cases, \(\hat{u}_p\) and \(\hat{u}_w\) respectively.

Then you compare EBVs of animals in partial and whole prediction. You don’t include all animals; you consider contemporary animals with similar information, in which you have an interest (for instance, males candidates to selection). We call this focal animals or focal groups. See below for examples.

The comparison is very simple and it just consist in a series of statistics that can be easily computed. We propose several criteria. This can be found in F. L. Macedo et al. (2020) which is the most up-to-date source. Note that in the following, whenever we put something like \(cov(\hat{\mathbf{u}}_p,\hat{\mathbf u}_w)\) we mean a \(scalar\) (the “observed” covariance) and not a matrix (which is the sampling or prior distribution of the vector).

15.3.1.1 Bias

This is measured using \(\hat{\Delta}_p=\bar{\hat{\mathbf{u}}}_p - \bar{\hat{\mathbf{u}}}_w\). The expectation is 0 (no bias). A positive value means that animals with partial information are overevaluated.

15.3.1.2 Slope

Also calledr over/underdispersion. This is measured using \(\hat{b}_p=\frac{cov(\hat{\mathbf{u}}_p,\hat{\mathbf u}_w)}{var(\hat{\mathbf u}_p)}\) or, equivalently, computing the slope \(b_1\) of the linear regression “whole on partial” \(\hat{u}_w \sim b_0 +b_1 \hat{u}_p+\epsilon\). The expectation is 1 (no over- neither under-dispersion), values lower than 1 mean that selected candidates are overestimated. This is the kind of bias commonly reported in dairy cattle studies.

A very small example with 5 individuals follows:

# these are actually 5 "proven" bulls 
EBV2018=c(999,849,831,953,764) 
EBV2019=c(973,833,904,963,807) 
Delta_p=mean(EBV2018)-mean(EBV2019) # -16.8 
b_p=cov(EBV2019,EBV2018)/var(EBV2018) #0.71
aa=lm(EBV2019~EBV2018) 
b_p=aa$coefficients[2] # 0.71

15.3.1.3 Accuracies

There are two estimators of relative accuracies and two estimators of absolute accuracies.

So, this estimates a ratio of accuracies and not the absolute accuracy. For instance, Values close to 1 indicate that “partial evaluation” was “as accurate” as “whole” evaluation, but both evaluations could be “little accurate”.

A byproduct of \(\hat{\rho}_{wp}\) is an estimator of the relative increase in accuracy. In effect, \(\frac{1}{\hat{\rho}_{wp}}-1\) has expected value \(\frac{acc_w-acc_p}{acc_p}\), which is the relative increase in accuracy from whole to partial . For instance, boars can be evaluated for carcass traits before or after some full-sibs have been slaughtered, and \(\frac{1}{\hat{\rho}_{wp}}-1\) gives the relative increase in accuracy.

Note that in fact this statistic \(\hat{\rho}_{wp}^2\) is the slope \(b_1\) of the regression “partial on whole”: \(\hat{u}_p \sim b_0 +b_1 \hat{u}_w+\epsilon\). A note of caution of this statistic is that the expected value requires that the evaluation is unbiased (\(\hat{b}_p=1\)) something that is not required for \(\hat{\rho}_{wp}\). In principle, the value obtained for \(\hat{\rho}_{wp}^2\) should be the square of the value obtained for \(\hat{\rho}_{wp}\), but this is not true in practice as it holds only in expectation.

Both statistics are easy to compute:

rho_pw=cor(EBV2018,EBV2019) # 0.9101622
rho2_pw=cov(EBV2019,EBV2018)/var(EBV2019)# 1.15944

note that in this example \(\hat{\rho}_{wp}^2\) is not admissible (\(\hat{\rho}_{wp}^2>1\) would mean that \(acc_p>acc_w\)) and this is because in the example \(\hat{b}_p\) is not even close to 1.

When animals are pre-selected (for instance, prospective AI rams selected based on parent average) their genetic variance \(\sigma^2_{u^*}\) is less than the “normal” genetic variance \(\sigma^2_{u}\). As an example, in Manech Tete Rousse, \(\sigma^2_{u} \approx 500\) but \(\sigma^2_{u^*} \approx 350\) for young selected rams (for milk yield) F. L. Macedo, Christensen, and Legarra (2021). The variance \(\sigma^2_{u^*}\) can be estimated using Gibbs Sampling (Sorensen, Fernando, and Gianola (2001),F. L. Macedo et al. (2020)).

So, this equation gives the “selected” reliability (Bijma (2012),Dekkers (1992)), which is the “ability” to rank within those animals (more difficult when they are selected). However, we can’t (easily) use this reliability to predict genetic progress, and we can’t compare it with results in less selected animals, say, females. Also, the numbers do not match with those model-based, i.e. by Selection Index theory or from the inverse of the MME. The solution to this was given by Dekkers (1992) and Bijma (2012), and it leads to the last statistic:

15.3.2 Examples of interpretation

Just to give a feeling of what these numbers look like and mean. When we did the first cross-validation approaches in dairy sheep, we used AI rams that after selection based on parent average, were used in progeny testing. In order to compute if genomic selection is good, we can evaluate these rams with ssGBLUP at birth, and then after progeny. The first result that we get is \(\hat{\rho}_{wp}\), but it can’t be used to predict genetic progress of genomic selection. Then we do better and we compute \(\widehat{acc}_p^2\), but we obtain a number that is very small because the animals are highly selected. What we want is the accuracy of the genomic young rams if they were not selected, because a genomic selection scheme genotypes a wide basis of animals. To do so, we use the equation above to transform \(\widehat{acc}_p^2\) into \(\widehat{rel}_p\), which is the number that we want.

For instance, we obtained the following Table (F. L. Macedo et al. (2020)):

Method \(\widehat{acc}_p^2\) \(\widehat{rel}_p\) \(\hat{\rho}^2_pw\)
BLUP-MF 0.22 0.53 0.32
SSGBLUP-MF 0.32 0.59 0.45

In the Table, the numbers of \(\widehat{acc}_p^2\) seems “obviously wrong” because, for instance, for BLUP the reliability of the Parent Average from progeny-tested sire and phenotyped mother is usually close to 0.5, much higher than the observed numbers of \(\sim 0.25\). However, the \(\widehat{acc}_p^2=0.22\) in BLUP is the reliability within the selected rams, whereas the reliability across all possible rams is in fact \(\widehat{rel}_p\), which has a value of 0.53 much closer to what we expect. The value of \(\rho^2_pw\) is more complicated to interpret. However, in the three columns it is obvious that SSGBLUP is more accurate than BLUP.

15.3.3 Practicalities

  1. You evaluate the bias and accuracies for a category of animals. We call this focal animals or focal groups. These are contemporary animals for which the properties above hold, which are “exchangeable” (in other words, we’re interested in the group, not in each individual animal) and in which we are interested. For instance young born rams can be a focal group. 1st-lambing females can be a focal group, and rams with first crop of daughters could be a focal group as well. But it is not a good idea to define a focal group composed of 50% progeny-tested rams 4 year old and 50% young animals that are 1 year old, because the first will be more accurate and the second more shrunken towards the mean. To define the focal animals, the best way is to do it by analyzing the data: for instance, take all \(m\) rams born in year say 2010, and from them select those \(n\) that had offspring with record in 2014, but not before. Then the number of animals in the focal group is \(n\).

  2. Define dates in a way such that the focal individual will have more information in the whole than in the partial data set. For instance, young rams could have only parents’ (and genomic) information in the partial data set and offspring information in the whole data set. First lambing females could have one record for milk yield in the partial and two records in the whole ; and similar cases. In the example above, the year of partial can be 2010 and the year of whole can be 2014.

  3. The way we do this is using the data set and “looking forward” from each year. For instance, we take all rams born in 2014 that were used in AI , and few years later (say 2017) we find out which of these rams have daughters with milk yield. This defines a focal group for “partial”=2014 and “whole”=2017 We can do the same for 2014 vs. 2018, 2019, etc.

  4. In these manner we have many “pairs” of whole and partial. For instance you can do “partial” at 2010, 2011,… and compare each of them vs. “whole” at 2014, 2015… . It is important to do several comparisons because the statistics vary a lot across years. Using several pairs of whole and partial requires automatic handling of files and data editing, that we do using automated scripts in R, Unix tools, and R scripting. The genetic evaluations, themselves, can be run in any software that you like.

  5. In practice we delete “records” (milk yield, etc etc) based on the year, and we keep ALL pedigree and ALL markers. A more refined approach is to keep pedigree and markers only up to the same date, for instance if “partial”= March 2014 we should keep records, pedigree and markers up to March 2014 (because pedigree and markers were used to predict the young rams).

  6. In genetic evaluations with Unknown Parent Groups, the EBVs are not estimable functions So you need to refer all EBVs to a common genetic base in order to infer “bias” or not. Typically the genetic base is something like “average EBV of all females born in 2010” or something like that.

All this requires good knowledge of the data sets, the breeding scheme (or the breed), and a good command of scripting and genetic

15.3.4 The importance of several comparisons

The Figure 1 below shows all the estimates of \(b_{pw}\) in F. L. Macedo et al. (2020). For instance, in the X-axis we see the year of cut-off of partial, and the repeated points correspond to several whole years: 2010, 2011… It is clear that there is a large variation of \(b_{pw}\) due to chance, so to assess the unbiasedness of genetic evaluation one should do several pairs of whole and partial and not rely on a single study. For instance, year 2008 evaluation was clearly biased (\(b_{pw}<1\)) whereas the other years were not.

Different estimates of b_{pw}

15.3.5 Estimation of genomic accuracies vs pedigree ones

How do I infer if a genomic evaluation is more accurate than a pedigree based one? There are two manners.

The first approach is to use whole and partial as we have explained so far, and evaluate each run both BLUP and genomic prediction (e.g. SSGBLUP), which yields a Table like above. This gives quite complete information as we can compare accuracies across methods and at different times.

The second approach is to consider that the genomic evaluation has “more data” so the pedigree-based evaluation is partial and the genomic evaluation is whole. The records \(\mathbf{y}\) are the same. in both. Then the statistics above describe the ratio, increase, or absolute accuracies. For instance if we observe \(\hat{\rho}_{pw}=1\), it means that adding genotypes did not change anything. However, if we obtain \(\hat{\rho}_{pw}=0.9\), it means that accuracy increased (relatively) by \(\frac{1}{\hat{\rho}_{pw}}-1=0.11\).

15.4 References

16 Appendix

Legarra A., Christensen O. F., Aguilar I., Misztal I., 2014 Single Step, a general approach for genomic selection. Livestock Science 166: 54–65.

Legarra A., Reverter A., 2017 Can we frame and understand cross-validation results in animal breeding? In: AAABG conference, Townsville, Australia. http://agbu.une.edu.au/AAABG_2017.html

Varona L., Legarra A., Toro M. A., Vitezica Z. G., 2018 Non-additive Effects in Genomic Selection. Front. Genet. 9.

References

Aguilar, Ignacio, Andres Legarra, Fernando Cardoso, Yutaka Masuda, Daniela Lourenco, and Ignacy Misztal. 2019. “Frequentist p-Values for Large-Scale-Single Step Genome-Wide Association, with an Application to Birth Weight in American Angus Cattle.” Genetics Selection Evolution 51 (1): 28. https://doi.org/10.1186/s12711-019-0469-3.
Aguilar, Ignacio, Ignacy Misztal, Andres Legarra, and Shogo Tsuruta. 2011. “Efficient Computations of Genomic Relationship Matrix and Other Matrices Used in the Single-Step Evaluation.” Journal of Animal Breeding and Genetics 128: 422–28.
Aguilar, Ignacio, Shogo Tsuruta, Yutaka Masuda, Daniela Lourenco, Andrés Legarra, and Ignacy Misztal. 2018. BLUPF90 Suite of Programs for Animal Breeding with Focus on Genomics.” In Proceedings of the World Congress on Genetics Applied to Livestock Production, Methods and Tools - Software:751. Auckland, New Zealand. http://www.wcgalp.org/proceedings/2018/blupf90-suite-programs-animal-breeding-focus-genomics.
Aguilar, I, and I Misztal. 2008. < i> Technical Note: Recursive Algorithm for Inbreeding Coefficients Assuming Nonzero Inbreeding of Unknown Parents.” Journal of Dairy Science 91 (4): 1669–72.
Aguilar, I, I Misztal, DL Johnson, A Legarra, S Tsuruta, and TJ Lawlor. 2010. “Hot Topic: A Unified Approach to Utilize Phenotypic, Full Pedigree, and Genomic Information for Genetic Evaluation of Holstein Final Score.” Journal of Dairy Science 93 (2): 743–52.
Aliloo, Hassan, Jennie E. Pryce, Oscar González-Recio, Benjamin G. Cocks, and Ben J. Hayes. 2016. “Accounting for Dominance to Improve Genomic Evaluations of Dairy Cows for Fertility and Milk Production Traits.” Genetics Selection Evolution 48: 8. https://doi.org/10.1186/s12711-016-0186-0.
Álvarez-Castro, José M., and Örjan Carlborg. 2007. “A Unified Model for Functional and Statistical Epistasis and Its Application in Quantitative Trait Loci Analysis.” Genetics 176 (2): 1151–67. https://doi.org/10.1534/genetics.106.067348.
Andersson, Leif. 2001. “Genetic Dissection of Phenotypic Diversity in Farm Animals.” Nature Reviews Genetics 2 (2): 130–38. https://doi.org/10.1038/35052563.
Bermann, Matias, Ignacio Aguilar, Daniela Lourenco, Ignacy Misztal, and Andres Legarra. 2023. “Reliabilities of Estimated Breeding Values in Models with Metafounders.” Genetics Selection Evolution 55 (1): 6. https://doi.org/10.1186/s12711-023-00778-2.
Bermann, Matias, Andres Legarra, Mary Kate Hollifield, Yutaka Masuda, Daniela Lourenco, and Ignacy Misztal. 2021. “Validation of Single-Step GBLUP Genomic Predictions from Threshold Models Using the Linear Regression Method: An Application in Chicken Mortality.” Journal of Animal Breeding and Genetics 138 (1): 4–13. https://doi.org/https://doi.org/10.1111/jbg.12507.
Bijma, P. 2012. “Accuracies of Estimated Breeding Values from Ordinary Genetic Evaluations Do Not Reflect the Correlation Between True and Estimated Breeding Values in Selected Populations.” Journal of Animal Breeding and Genetics 129 (5): 345–58.
Caballero, A., and M. A. Toro. 2002. “Analysis of Genetic Diversity for the Management of Conserved Subdivided Populations.” Conservation Genetics 3: 289. https://doi.org/https://doi.org/10.1023/A:1019956205473.
Campos, Gustavo de los, John M Hickey, Ricardo Pong-Wong, Hans D Daetwyler, and Mario PL Calus. 2013. “Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding.” Genetics 193 (2): 327–45.
Campos, Gustavo de los, Hugo Naya, Daniel Gianola, José Crossa, Andrés Legarra, Eduardo Manfredi, Kent Weigel, and José Miguel Cotes. 2009. “Predicting Quantitative Traits with Regression Models for Dense Molecular Markers and Pedigree.” Genetics 182 (1): 375–85. http://dx.doi.org/10.1534/genetics.109.101501.
Casella, George, and Roger L Berger. 1990. Statistical Inference. Vol. 70. Duxbury Press Belmont, CA.
Christensen, OF, P. Madsen, B. Nielsen, T. Ostersen, and G. Su. 2012. “Single-Step Methods for Genomic Evaluation in Pigs.” Animal 6: 1565–71.
Christensen, Ole F. 2012. “Compatibility of Pedigree-Based and Marker-Based Relationship Matrices for Single-Step Genetic Evaluation.” Genetics Selection Evolution 44 (1): 37.
Christensen, Ole F, and Mogens S Lund. 2010. “Genomic Prediction When Some Animals Are Not Genotyped.” Genet Sel Evol 42 (1): 2. http://dx.doi.org/10.1186/1297-9686-42-2.
Cochran, WG. 1951. “Improvement by Means of Selection.” In, 1:449–70.
Cockerham, C. C. 1969. “Variance of Gene Frequencies.” Evolution 23 (1): 72–84.
Cockerham, C. Clark. 1954. “An Extension of the Concept of Partitioning Hereditary Variance for Analysis of Covariances Among Relatives When Epistasis Is Present.” Genetics 39 (6): 859. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1209694/.
Cole, JB, PM VanRaden, JR O’Connell, CP Van Tassell, TS Sonstegard, RD Schnabel, JF Taylor, and GR Wiggans. 2009. “Distribution and Location of Genetic Effects for Dairy Traits.” Journal of Dairy Science 92 (6): 2931–46.
Colleau, J. J. 2002. “An Indirect Approach to the Extensive Calculation of Relationship Coefficients.” Genetics Selection Evolution 34 (4): 409–22.
Colleau, Jean-Jacques, Isabelle Palhière, Silvia T. Rodríguez-Ramilo, and Andres Legarra. 2017. “A Fast Indirect Method to Compute Functions of Genomic Relationships Concerning Genotyped and Ungenotyped Individuals, for Diversity Management.” Genetics Selection Evolution 49 (December): 87. https://doi.org/10.1186/s12711-017-0363-9.
Colombani, Carine, Andres Legarra, Sebastien Fritz, François Guillaume, Pascal Croiseau, Vincent Ducrocq, and Christèle Robert-Granié. 2013. “Application of Bayesian Least Absolute Shrinkage and Selection Operator (LASSO) and BayesCPi Methods for Genomic Selection in French Holstein and Montbéliarde Breeds.” Journal of Dairy Science 96 (1): 575–91.
Consortium, The Bovine Genome Sequencing and Analysis, Christine G. Elsik, Ross L. Tellam, and Kim C. Worley. 2009. “The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution.” Science 324 (5926): 522–28. https://doi.org/10.1126/science.1169588.
De Boer, IJM, and I Hoeschele. 1993. “Genetic Evaluation Methods for Populations with Dominance and Inbreeding.” Theoretical and Applied Genetics 86 (2-3): 245–58.
Dekkers, J. C. M. 1992. “Asymptotic Response to Selection on Best Linear Unbiased Predictors of Breeding Values.” Animal Science 54 (3): 351–60. https://doi.org/10.1017/S0003356100020808.
Dunner, Susana, M. Eugenia Miranda, Yves Amigues, Javier Cañón, Michel Georges, Roger Hanset, John Williams, and François Ménissier. 2003. “Haplotype Diversity of the Myostatin Gene Among Beef Cattlebreeds.” Genetics Selection Evolution 35 (January): 103. https://doi.org/10.1186/1297-9686-35-1-103.
Eding, H, and THE Meuwissen. 2001. “Marker‐based Estimates of Between and Within Population Kinships for the Conservation of Genetic Diversity.” Journal of Animal Breeding and Genetics 118 (3): 141–59.
Emigh, T. H. 1980. A Comparison of Tests for Hardy-Weinberg Equilibrium.” Biometrics 36 (4): 627–42.
Ertl, Johann, Andrés Legarra, Zulma G. Vitezica, Luis Varona, Christian Edel, Reiner Emmerling, Kay-Uwe Götz, et al. 2014. “Genomic Analysis of Dominance Effects on Milk Production and Conformation Traits in Fleckvieh Cattle.” Genet Sel Evol 46: 40. http://www.biomedcentral.com/content/pdf/1297-9686-46-40.pdf.
Esfandyari, Hadi, Piter Bijma, Mark Henryon, Ole Fredslund Christensen, and Anders Christian Sørensen. 2016. “Genomic Prediction of Crossbred Performance Based on Purebred Landrace and Yorkshire Data Using a Dominance Model.” Genetics Selection Evolution 48 (1): 40. https://doi.org/10.1186/s12711-016-0220-2.
Falconer, D. S., and T. F. C. Mackay. 1996. Introduction to Quantitative Genetics. Longman New York.
Fernando, R. L., and M. Grossman. 1989. “Marker Assisted Prediction Using Best Linear Unbiased Prediction.” Genetics Selection Evolution 21: 467–77.
———. 1990. “Genetic Evaluation with Autosomal and X-Chromosomal Inheritance.” Theoretical and Applied Genetics 80 (1): 75–80. https://doi.org/10.1007/BF00224018.
Fernando, R. L., D. Habier, C. Stricker, J. C. M. Dekkers, and L. R. Totir. 2007. “Genomic Selection.” Acta Agriculturae Scandinavica, A 57 (4): 192–95.
Fernando, RL, and D Gianola. 1986. “Optimal Properties of the Conditional Mean as a Selection Criterion.” Theoretical and Applied Genetics 72 (6): 822–25.
Forneris, Natalia S, Andres Legarra, Zulma G Vitezica, Shogo Tsuruta, Ignacio Aguilar, Ignacy Misztal, and Rodolfo JC Cantet. 2015. “Quality Control of Genotypes Using Heritability Estimates of Gene Content at the Marker.” Genetics 199: 675–81.
Forni, Selma, Ignacio Aguilar, and Ignacy Misztal. 2011. “Different Genomic Relationship Matrices for Single-Step Analysis Using Phenotypic, Pedigree and Genomic Information.” Genetics Selection Evolution 43 (1): 1. http://www.gsejournal.org/content/43/1/1.
Fragomeni, Breno O., Daniela A. L. Lourenco, Yutaka Masuda, Andres Legarra, and Ignacy Misztal. 2017. “Incorporation of Causative Quantitative Trait Nucleotides in Single-Step GBLUP.” Genetics Selection Evolution 49 (July): 59. https://doi.org/10.1186/s12711-017-0335-0.
Garcia-Baccino, Carolina A., Andres Legarra, Ole F. Christensen, Ignacy Misztal, Ivan Pocrnic, Zulma G. Vitezica, and Rodolfo J. C. Cantet. 2017. “Metafounders Are Related to Fst Fixation Indices and Reduce Bias in Single-Step Genomic Evaluations.” Genetics Selection Evolution 49: 34. https://doi.org/10.1186/s12711-017-0309-2.
Garcia-Cortes, Luis Alberto, Andres Legarra, Claude Chevalet, and Miguel Angel Toro. 2013. “Variance and Covariance of Actual Relationships Between Relatives at One Locus.” PLoS ONE 8 (2): e57003.
Garrick, Dorian J, Jeremy F Taylor, and Rohan L Fernando. 2009. “Deregressing Estimated Breeding Values and Weighting Information for Genomic Regression Analyses.” Genet Sel Evol 41 (55): 44.
Gengler, Nicolas, S Abras, C Verkenne, SYLVIE Vanderick, M Szydlowski, and Robert Renaville. 2008. “Accuracy of Prediction of Gene Content in Large Animal Populations and Its Use for Candidate Gene Detection and Genetic Evaluation.” Journal of Dairy Science 91 (4): 1652–59.
Gengler, N., P. Mayeres, and M. Szydlowski. 2007. “A Simple Method to Approximate Gene Content in Large Pedigree Populations: Application to the Myostatin Gene in Dual-Purpose Belgian Blue Cattle.” Animal 1 (01): 21–28.
George, Edward I, and Robert E McCulloch. 1993. “Variable Selection via Gibbs Sampling.” Journal of the American Statistical Association 88 (423): 881–89.
Gianola, Daniel, Gustavo de los Campos, William G Hill, Eduardo Manfredi, and Rohan Fernando. 2009. “Additive Genetic Variability and the Bayesian Alphabet.” Genetics 183 (1): 347–63. http://dx.doi.org/10.1534/genetics.109.103952.
Gianola, Daniel, Rohan L Fernando, and Alessandra Stella. 2006. “Genomic-Assisted Prediction of Genetic Value with Semiparametric Procedures.” Genetics 173 (3): 1761–76. http://dx.doi.org/10.1534/genetics.105.049510.
Gianola, D., and R. L. Fernando. 1986. “Bayesian Methods in Animal Breeding Theory.” Journal of Animal Science 63 (1): 217.
Goffinet, B., and JM Elsen. 1984. “Critere Optimal de Selection: Quelques Resultats Generaux.” G{\’e}n{\’e}tique s{\’e}lection {\’e}volution 16 (3): 307–18.
Group, The International SNP Map Working. 2001. “A Map of Human Genome Sequence Variation Containing 1.42 Million Single Nucleotide Polymorphisms.” Nature 409 (6822): 928–33. https://doi.org/10.1038/35057149.
Gualdrón Duarte, Jose L., Rodolfo JC Cantet, Ronald O. Bates, Catherine W. Ernst, Nancy E. Raney, and Juan P. Steibel. 2014. “Rapid Screening for Phenotype-Genotype Associations by Linear Transformations of Genomic Evaluations.” BMC Bioinformatics 15: 246. https://doi.org/10.1186/1471-2105-15-246.
Habier, David, Rohan L Fernando, Kadir Kizilkaya, and Dorian J Garrick. 2011. “Extension of the Bayesian Alphabet for Genomic Selection.” BMC Bioinformatics 12 (1): 186.
Habier, D., R. L. Fernando, and J. C M Dekkers. 2007. “The Impact of Genetic Relationship Information on Genome-Assisted Breeding Values.” Genetics 177 (4): 2389–97. http://dx.doi.org/10.1534/genetics.107.081190.
Harris, B. L., and D. L. Johnson. 2010. “Genomic Predictions for New Zealand Dairy Bulls and Integration with National Genetic Evaluation.” J Dairy Sci 93 (3): 1243–52. http://dx.doi.org/10.3168/jds.2009-2619.
Harville, D. 1976. “Extension of the Gauss-Markov Theorem to Include the Estimation of Random Effects.” The Annals of Statistics 4 (2): 384–95.
Hayes, B. J., P. M. Visscher, and M. E. Goddard. 2009. “Increased Accuracy of Artificial Selection by Using the Realized Relationship Matrix.” Genet Res 91 (1): 47–60. http://dx.doi.org/10.1017/S0016672308009981.
Hayes, BJ. 2011. < i> Technical Note: Efficient Parentage Assignment and Pedigree Reconstruction with Dense Single Nucleotide Polymorphism Data.” Journal of Dairy Science 94 (4): 2114–17.
Henderson, C. R. 1973. “Sire Evaluations and Genetic Trends.” J Anim Sci Symposium.
———. 1976. “A Simple Method for Computing the Inverse of a Numerator Relationship Matrix Used in Prediction of Breeding Values.” Biometrics 32 (1): 69–83.
———. 1982. “Best Linear Unbiased Prediction in Populations That Have Undergone Selection.” In Proceedings of the World Congress on Sheep and Beef Cattle Breeding, 1:191–201.
———. 1984. Applications of Linear Models in Animal Breeding. University of Guelph, Guelph.
———. 1985. “Best Linear Unbiased Prediction of Nonadditive Genetic Merits.” J. Anim. Sci 60: 111–17. http://xa.yimg.com/kq/groups/20928795/234496567/name/Henderson,+1985.pdf.
Henderson, Charles R. 1975. “Best Linear Unbiased Estimation and Prediction Under a Selection Model.” Biometrics, 423–47. http://www.jstor.org/stable/2529430.
Henderson, CR. 1978. “Undesirable Properties of Regressed Least Squares Prediction of Breeding Values.” Journal of Dairy Science 61 (1): 114–20.
Hickey, J. M., B. P. Kinghorn, B. Tier, J. F. Wilson, N. Dunstan, and J. H. J. van der Werf. 2011. “A Combined Long-Range Phasing and Long Haplotype Imputation Method to Impute Phase for SNP Genotypes.” Genet Sel Evol 43. https://doi.org/10.1186/1297-9686-43-12.
Hill, W. G., and B. S. Weir. 2011. “Variation in Actual Relationship as a Consequence of Mendelian Sampling and Linkage.” Genet Res (Camb), January, 1–18. http://dx.doi.org/10.1017/S0016672310000480.
Hill, W.g., and A. Mäki-Tanila. 2015. “Expected Influence of Linkage Disequilibrium on Genetic Variance Caused by Dominance and Epistasis on Quantitative Traits.” Journal of Animal Breeding and Genetics 132 (2): 176–86. https://doi.org/10.1111/jbg.12140.
Hill, WG, and Alan Robertson. 1968. “Linkage Disequilibrium in Finite Populations.” Theoretical and Applied Genetics 38 (6): 226–31.
Hill, William G., Michael E. Goddard, and Peter M. Visscher. 2008. “Data and Theory Point to Mainly Additive Genetic Variance for Complex Traits.” PLOS Genetics 4 (2): e1000008. https://doi.org/10.1371/journal.pgen.1000008.
Hsu, Wan-Ling, Dorian J. Garrick, and Rohan L. Fernando. 2017. “The Accuracy and Bias of Single-Step Genomic Prediction for Populations Under Selection.” G3: Genes, Genomes, Genetics 7 (8): 2685–94. http://www.g3journal.org/content/7/8/2685.abstract.
Huang, Wen, and Trudy F. C. Mackay. 2016. “The Genetic Architecture of Quantitative Traits Cannot Be Inferred from Variance Component Analysis.” PLOS Genetics 12 (11): e1006421. https://doi.org/10.1371/journal.pgen.1006421.
Jensen, Just, Guosheng Su, and Per Madsen. 2012. “Partitioning Additive Genetic Variance into Genomic and Remaining Polygenic Components for Complex Traits in Dairy Cattle.” BMC Genetics 13 (1): 44.
Kachman, Stephen D., Matthew L. Spangler, Gary L. Bennett, Kathryn J. Hanford, Larry A. Kuehn, Warren M. Snelling, R. Mark Thallman, et al. 2013. “Comparison of Molecular Breeding Values Based on Within- and Across-Breed Training in Beef Cattle.” Genetics Selection Evolution 45 (August): 30. https://doi.org/10.1186/1297-9686-45-30.
Kass, Robert E, and Adrian E Raftery. 1995. “Bayes Factors.” Journal of the American Statistical Association 90 (430): 773–95.
Kennedy, BW, M Quinton, and JA Van Arendonk. 1992. “Estimation of Effects of Single Genes on Quantitative Traits.” Journal of Animal Science 70 (7): 2000–2012.
Lande, R., and R. Thompson. 1990. “Efficiency of Marker-Assisted Selection in the Improvement of Quantitative Traits.” Genetics 124 (3): 743–56.
Legarra, A., I. Aguilar, and I. Misztal. 2009. “A Relationship Matrix Including Full Pedigree and Genomic Information.” J Dairy Sci 92 (9): 4656–63. http://dx.doi.org/10.3168/jds.2009-2061.
Legarra, A, and I Misztal. 2008. “Technical Note: Computing Strategies in Genome-Wide Selection.” Journal of Dairy Science 91 (1): 360–66.
Legarra, Andres. 2016. “Comparing Estimates of Genetic Variance Across Different Relationship Models.” Theoretical Population Biology 107: 26–30. http://www.sciencedirect.com/science/article/pii/S0040580915000830.
Legarra, Andres, Ole F. Christensen, Ignacio Aguilar, and Ignacy Misztal. 2014. “Single Step, a General Approach for Genomic Selection.” Livestock Science 166: 54–65.
Legarra, Andres, Ole F. Christensen, Zulma G. Vitezica, Ignacio Aguilar, and Ignacy Misztal. 2015. “Ancestral Relationships Using Metafounders: Finite Ancestral Populations and Across Population Relationships.” Genetics 200 (2): 455–68. https://doi.org/10.1534/genetics.115.177014.
Legarra, Andres, Pascal Croiseau, Marie Pierre Sanchez, Simon Teyssèdre, Guillaume Sallé, Sophie Allais, Sébastien Fritz, Carole Rénée Moreno, Anne Ricard, and Jean-Michel Elsen. 2015. “A Comparison of Methods for Whole-Genome QTL Mapping Using Dense Markers in Four Livestock Species.” Genetics Selection Evolution 47: 6. https://doi.org/10.1186/s12711-015-0087-7.
Legarra, Andres, and Antonio Reverter. 2018. “Semi-Parametric Estimates of Population Accuracy and Bias of Predictions of Breeding Values and Future Phenotypes Using the LR Method [Erratum: Dec. 2019, v. 51 (1), p. 69].”
Legarra, Andrés, Christèle Robert-Granié, Pascal Croiseau, François Guillaume, and Sébastien Fritz. 2011. “Improved Lasso for Genomic Selection.” Genet Res (Camb) 93 (1): 77–87. http://dx.doi.org/10.1017/S0016672310000534.
Legarra, Andrés, Christèle Robert-Granié, Eduardo Manfredi, and Jean-Michel Elsen. 2008. “Performance of Genomic Selection in Mice.” Genetics 180 (1): 611–18. http://dx.doi.org/10.1534/genetics.108.088575.
Legarra, Andrés, and Zulma G. Vitezica. 2015. “Genetic Evaluation with Major Genes and Polygenic Inheritance When Some Animals Are Not Genotyped Using Gene Content Multiple-Trait BLUP.” Genetics Selection Evolution 47: 89. https://doi.org/10.1186/s12711-015-0165-x.
Legarra, A, A Ricardi, and O Filangi. 2011. GS3: Genomic Selection, Gibbs Sampling, Gauss-Seidel (and BayesCp).”
Lehermeier, C., G. de los Campos, V. Wimmer, and C.-C. Schön. 2017. “Genomic Variance Estimates: With or Without Disequilibrium Covariances?” Journal of Animal Breeding and Genetics 134 (3): 232–41. https://doi.org/10.1111/jbg.12268.
Li, C. C., and D. G. Horvitz. 1953. “Some Methods of Estimating the Inbreeding Coefficient.” Am J Hum Genet 5 (2): 107–17.
Lourenco, D. a. L., S. Tsuruta, B. O. Fragomeni, Y. Masuda, I. Aguilar, A. Legarra, J. K. Bertrand, et al. 2015. “Genetic Evaluation Using Single-Step Genomic Best Linear Unbiased Predictor in American Angus.” Journal of Animal Science 93 (6): 2653–62. https://doi.org/10.2527/jas.2014-8836.
Lourenco, DAL, I Misztal, S Tsuruta, I Aguilar, TJ Lawlor, S Forni, and JI Weller. 2014. “Are Evaluations on Young Genotyped Animals Benefiting from the Past Generations?” Journal of Dairy Science 97 (6): 3930–42.
Lourenco, DAL, I Misztal, H Wang, I Aguilar, S Tsuruta, and JK Bertrand. 2013. “Prediction Accuracy for a Simulated Maternally Affected Trait of Beef Cattle Using Different Genomic Evaluation Models.” Journal of Animal Science 91 (9): 4090–98.
Lourenco, Daniela AL, Breno O. Fragomeni, Shogo Tsuruta, Ignacio Aguilar, Birgit Zumbach, Rachel J. Hawken, Andres Legarra, and Ignacy Misztal. 2015. “Accuracy of Estimated Breeding Values with Genomic Information on Males, Females, or Both: An Example on Broiler Chicken.” Genetics Selection Evolution 47 (1): 56.
Luan, Tu, John Woolliams, Jorgen Odegard, Marlies Dolezal, Sergio Roman-Ponce, Alessandro Bagnato, and Theo Meuwissen. 2012. “The Importance of Identity-by-State Information for the Accuracy of Genomic Selection.” GENETICS SELECTION EVOLUTION 44 (1): 28. http://www.gsejournal.org/content/44/1/28.
Luo, ZW. 1998. “Detecting Linkage Disequilibrium Between a Polymorphic Marker Locus and a Trait Locus in Natural Populations.” Heredity 80 (2): 198–208.
Lynch, M. 1988. “Estimation of Relatedness by DNA Fingerprinting.” Mol Biol Evol 5 (5): 584–99.
Lynch, M., and B. Walsh. 1998. Genetics and Analysis of Quantitative Traits. Sinauer associates.
Macedo, Fernando L., Ole F. Christensen, Jean-Michel Astruc, Ignacio Aguilar, Yutaka Masuda, and Andrés Legarra. 2020. “Bias and Accuracy of Dairy Sheep Evaluations Using BLUP and SSGBLUP with Metafounders and Unknown Parent Groups.” Genetics, Selection, Evolution : GSE 52 (August). https://doi.org/10.1186/s12711-020-00567-1.
Macedo, Fernando L., Ole F. Christensen, and Andrés Legarra. 2021. “Selection and Drift Reduce Genetic Variation for Milk Yield in Manech Tête Rousse Dairy Sheep.” JDS Communications 2 (1): 31–34. https://doi.org/10.3168/jdsc.2020-0010.
Macedo, FL, A Reverter, and Andrés Legarra. 2020. “Behavior of the Linear Regression Method to Estimate Bias and Accuracies with Correct and Incorrect Genetic Evaluation Models.” Journal of Dairy Science 103 (1): 529–44.
Marchal, Alexandre, Andrés Legarra, Sébastien Tisné, Catherine Carasco-Lacombe, Aurore Manez, Edyana Suryana, Alphonse Omoré, Bruno Nouy, Tristan Durand-Gasselin, and Leopoldo Sánchez. 2016. “Multivariate Genomic Model Improves Analysis of Oil Palm (Elaeis Guineensis Jacq.) Progeny Tests.” Molecular Breeding 36 (1): 2.
Martini, Johannes W. R., Valentin Wimmer, Malena Erbe, and Henner Simianer. 2016. “Epistasis and Covariance: How Gene Interaction Translates into Genomic Relationship.” TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik 129 (5): 963–76. https://doi.org/10.1007/s00122-016-2675-5.
Matukumalli, Lakshmi K., Cynthia T. Lawley, Robert D. Schnabel, Jeremy F. Taylor, Mark F. Allan, Michael P. Heaton, Jeff O’Connell, et al. 2009. “Development and Characterization of a High Density SNP Genotyping Assay for Cattle.” PloS One 4 (4): e5350. https://doi.org/10.1371/journal.pone.0005350.
Meuwissen, T. H. E., B. J. Hayes, and M. E. Goddard. 2001. “Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps.” Genetics 157 (4): 1819–29.
Meuwissen, Theo, and Mike Goddard. 2010. “The Use of Family Relationships and Linkage Disequilibrium to Impute Phase and Missing Genotypes in up to Whole-Genome Sequence Density Genotypic Data.” Genetics 185 (4): 1441–49.
Meuwissen, Theo, Ben Hayes, and Mike Goddard. 2016. “Genomic Selection: A Paradigm Shift in Animal Breeding.” Animal Frontiers 6 (1): 6–14. https://doi.org/10.2527/af.2016-0002.
Misztal, Ignacy. n.d. “Shortage of Quantitative Geneticists in Animal Breeding.” Journal of Animal Breeding and Genetics 124 (5): 255–56. Accessed July 17, 2018. https://doi.org/10.1111/j.1439-0388.2007.00679.x.
Misztal, I., A. Legarra, and I. Aguilar. 2009. “Computing Procedures for Genetic Evaluation Including Phenotypic, Full Pedigree, and Genomic Information.” J Dairy Sci 92 (9): 4648–55. http://dx.doi.org/10.3168/jds.2009-2064.
Misztal, I, Zulma-Gladis Vitezica, Andres Legarra, I Aguilar, and AA Swan. 2013. “Unknown‐parent Groups in Single‐step Genomic Evaluation.” Journal of Animal Breeding and Genetics 130 (4): 252–58.
Moghaddar, N., and J. H. J. van der Werf. 2017. “Genomic Estimation of Additive and Dominance Effects and Impact of Accounting for Dominance on Accuracy of Genomic Evaluation in Sheep Populations.” Journal of Animal Breeding and Genetics 134 (6): 453–62. https://doi.org/10.1111/jbg.12287.
Mrode, RA, and R. Thompson. 2005. Linear Models for the Prediction of Animal Breeding Values. Cabi.
Muñoz, Patricio R., Marcio F. R. Resende, Salvador A. Gezan, Marcos Deon Vilela Resende, Gustavo de los Campos, Matias Kirst, Dudley Huber, and Gary F. Peter. 2014. “Unraveling Additive from Nonadditive Effects Using Genomic Relationship Matrices.” Genetics 198 (4): 1759–68. https://doi.org/10.1534/genetics.114.171322.
Nejati-Javaremi, A., C. Smith, and J. P. Gibson. 1997. “Effect of Total Allelic Relationship on Accuracy of Evaluation and Response to Selection.” J Anim Sci 75 (7): 1738–45.
Page, B. T., E. Casas, M. P. Heaton, N. G. Cullen, D. L. Hyndman, C. A. Morris, A. M. Crawford, et al. 2002. “Evaluation of Single-Nucleotide Polymorphisms in CAPN1 for Association with Meat Tenderness in Cattle,” Journal of Animal Science 80 (12): 3077–85. https://doi.org/10.2527/2002.80123077x.
Park, T., and G. Casella. 2008. “The Bayesian Lasso.” Journal of the American Statistical Association 103 (482): 681–86.
Pérez, Paulino, Gustavo de Los Campos, José Crossa, and Daniel Gianola. 2010. “Genomic-Enabled Prediction Based on Molecular Markers and Pedigree Using the Bayesian Linear Regression Package in R.” The Plant Genome 3 (2): 106–16.
Powell, Joseph E, Peter M Visscher, and Michael E Goddard. 2010. “Reconciling the Analysis of IBD and IBS in Complex Trait Studies.” Nat Rev Genet 11 (September): 800–805. http://dx.doi.org/10.1038/nrg2865.
Quaas, R. L. 1976. “Computing the Diagonal Elements and Inverse of a Large Numerator Relationship Matrix.” Biometrics 32 (4): 949–53.
———. 1988. “Additive Genetic Model with Groups and Relationships.” Journal of Dairy Science 71: 1338–45.
Reverter, A., B. L. Golden, R. M. Bourdon, and J. S. Brinks. 1994. “Technical Note: Detection of Bias in Genetic Predictions.” Journal of Animal Science 72 (1): 34–37. https://doi.org/10.2527/1994.72134x.
Ricard, A, S Danvy, and A Legarra. 2013. “Computation of Deregressed Proofs for Genomic Selection When Own Phenotypes Exist with an Application in French Show-Jumping Horses.” Journal of Animal Science 91 (3): 1076–85.
Ritland, K. 1996. “Estimators for Pairwise Relatedness and Individual Inbreeding Coefficients.” Genetical Research 67 (2): 175–85.
Rodríguez-Ramilo, Silvia Teresa, Luis Alberto García-Cortés, and Óscar González-Recio. 2014. “Combining Genomic and Genealogical Information in a Reproducing Kernel Hilbert Spaces Regression Model for Genome-Enabled Predictions in Dairy Cattle.” PLoS ONE 9 (3): e93424.
Rogers, Alan R., and Chad Huff. 2009. “Linkage Disequilibrium Between Loci With Unknown Phase.” Genetics 182 (3): 839–44. https://doi.org/10.1534/genetics.108.093153.
Saatchi, Mahdi, Mathew C. McClure, Stephanie D. McKay, Megan M. Rolf, JaeWoo Kim, Jared E. Decker, Tasia M. Taxis, et al. 2011. “Accuracies of Genomic Breeding Values in American Angus Beef Cattle Using K-Means Clustering for Cross-Validation.” Genetics Selection Evolution 43: 40. https://doi.org/10.1186/1297-9686-43-40.
Schork, Nicholas J., Daniele Fallin, and Jerry S. Lanchbury. 2000. “Single Nucleotide Polymorphisms and the Future of Genetic Epidemiology.” Clinical Genetics 58 (4): 250–64. https://doi.org/10.1034/j.1399-0004.2000.580402.x.
Searle, S. R. 1982. Matrix Algebra Useful for Statistics. John Wiley.
Shen, Xia, Moudud Alam, Freddy Fikse, and Lars Rönnegård. 2013. “A Novel Generalized Ridge Regression Method for Quantitative Genetics.” Genetics 193 (4): 1255–68.
Sillanpaa, MJ. 2011. “On Statistical Methods for Estimating Heritability in Wild Populations.” Molecular Ecology 20 (7): 1324–32.
Smith, J A, A M Lewis, P Wiener, and J L Williams. 2000. “Genetic Variation in the Bovine Myostatin Gene in UK Beef Cattle: Allele Frequencies and Haplotype Analysis in the South Devon.” Animal Genetics 31 (5): 306–9. https://doi.org/10.1046/j.1365-2052.2000.00521.x.
Snelling, W. M., M. F. Allan, J. W. Keele, L. A. Kuehn, R. M. Thallman, G. L. Bennett, C. L. Ferrell, et al. 2011. “Partial-Genome Evaluation of Postweaning Feed Intake and Efficiency of Crossbred Beef Cattle,” Journal of Animal Science 89 (6): 1731–41. https://doi.org/10.2527/jas.2010-3526.
Soller, M., and J. S. Beckmann. 1983. “Genetic Polymorphism in Varietal Identification and Genetic Improvement.” Theoretical and Applied Genetics 67 (1): 25–33. https://doi.org/10.1007/BF00303917.
Sorensen, Daniel, R. Fernando, and D. Gianola. 2001. “Inferring the Trajectory of Genetic Variance in the Course of Artificial Selection.” Genetical Research 77 (01): 83–94. http://journals.cambridge.org/abstract_S0016672300004845.
Sorensen, Daniel, and Daniel Gianola. 2002. Likelihood, Bayesian and MCMC Methods in Quantitative Genetics. Springer.
Stoneking, Mark. 2001. “Single Nucleotide Polymorphisms: From the Evolutionary Past. . .” Nature 409 (6822): 821–22. https://doi.org/10.1038/35057279.
Strandén, I., and D. J. Garrick. 2009. “Technical Note: Derivation of Equivalent Computing Algorithms for Genomic Predictions and Reliabilities of Animal Merit.” J Dairy Sci 92 (6): 2971–75. http://dx.doi.org/10.3168/jds.2008-1929.
Strandén, I., K. Matilainen, G.p. Aamand, and E.a. Mäntysaari. 2017. “Solving Efficiently Large Single-Step Genomic Best Linear Unbiased Prediction Models.” Journal of Animal Breeding and Genetics 134 (3): 264–74. https://doi.org/10.1111/jbg.12257.
Strandén, Ismo, and Ole F Christensen. 2011. “Allele Coding in Genomic Evaluation.” Genet Sel Evol 43: 25. http://dx.doi.org/10.1186/1297-9686-43-25.
Su, Guosheng, Ole F Christensen, Tage Ostersen, Mark Henryon, and Mogens S Lund. 2012. “Estimating Additive and Non-Additive Genetic Variances and Predicting Genetic Merits Using Genome-Wide Dense Single Nucleotide Polymorphism Markers.” PLoS ONE 7 (9): e45293.
Su, Guosheng, Bernt Guldbrandtsen, Gert P Aamand, Ismo Strandén, and Mogens S Lund. 2014. “Genomic Relationships Based on X Chromosome Markers and Accuracy of Genomic Predictions with and Without X Chromosome Markers.” Genetics, Selection, Evolution : GSE 46 (1): 47. https://doi.org/10.1186/1297-9686-46-47.
Su, Guosheng, Per Madsen, Ulrik Sander Nielsen, Esa A Mäntysaari, Gert P Aamand, Ole Fredslund Christensen, and Mogens Sandø Lund. 2012. “Genomic Prediction for Nordic Red Cattle Using One-Step and Selection Index Blending.” Journal of Dairy Science 95 (2): 909–17.
Sun, C., P. M. VanRaden, J. R. O’Connell, K. A. Weigel, and D. Gianola. 2013. “Mating Programs Including Genomic Relationships and Dominance Effects.” Journal of Dairy Science 96 (12): 8014–23. https://doi.org/10.3168/jds.2013-6969.
Sun, Xiaochen, Long Qu, Dorian J Garrick, Jack CM Dekkers, and Rohan L Fernando. 2012. “A Fast EM Algorithm for BayesA-Like Prediction of Genomic Breeding Values.” PLoS ONE 7 (11): e49157.
Sved, JA. 1971. “Linkage Disequilibrium and Homozygosity of Chromosome Segments in Finite Populations.” Theoretical Population Biology 2 (2): 125–41.
Tenesa, Albert, Pau Navarro, Ben J Hayes, David L Duffy, Geraldine M Clarke, Mike E Goddard, and Peter M Visscher. 2007. “Recent Human Effective Population Size Estimated from Linkage Disequilibrium.” Genome Research 17 (4): 520–26.
Thompson, Elizabeth A. 2013. “Identity by Descent: Variation in Meiosis, Across Genomes, and in Populations.” Genetics 194 (2): 301–26. https://doi.org/10.1534/genetics.112.148825.
Thompson, R. 1979. “Sire Evaluation.” Biometrics 35: 339–53.
Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological) 58 (1): 267–88.
Tier, Bruce, Karin Meyer, and Andrew Swan. 2018. “On Implied Genetic Effects, Relationships and Alternate Allele Coding.” In Proceedings of the World Congress on Genetics Applied to Livestock Production, Methods and Tools - Models and Computing Strategies 2:127. Auckland, New Zealand.
Toro, Miguel Ángel, Luis Alberto García-Cortés, and Andrés Legarra. 2011. “A Note on the Rationale for Estimating Genealogical Coancestry from Molecular Markers.” Genet Sel Evol 43 (1): 27.
Toro, Miguel A, and Luis Varona. 2010. “A Note on Mate Allocation for Dominance Handling in Genomic Selection.” Genet Sel Evol 42: 33.
VanRaden, P. M. 2007. “Genomic Measures of Relationship and Inbreeding.” Interbull Bull 37: 33–36.
———. 2008. “Efficient Methods to Compute Genomic Predictions.” J. Dairy Sci. 91 (11): 4414–23. http://jds.fass.org/cgi/content/abstract/91/11/4414.
VanRaden, P. M., C. P. Van Tassell, G. R. Wiggans, T. S. Sonstegard, R. D. Schnabel, J. F. Taylor, and F. S. Schenkel. 2009. “Invited Review: Reliability of Genomic Predictions for North American Holstein Bulls.” J Dairy Sci 92 (1): 16–24. http://dx.doi.org/10.3168/jds.2008-1514.
VanRaden, Paul M, Jeffrey R O’Connell, George R Wiggans, and Kent A Weigel. 2011. “Genomic Evaluations with Many More Genotypes.” Genetics, Selection, Evolution : GSE 43 (1): 10. https://doi.org/10.1186/1297-9686-43-10.
VanRaden, PM, and GR Wiggans. 1991. “Derivation, Calculation, and Use of National Animal Model Information.” Journal of Dairy Science 74 (8): 2737–46.
Varona, L. 2014. “A General Approach for Calculation of Genomic Relationship Matrices for Epistatic Effects.” In, 11:11–22. Vancouver, Canada. https://www.asas.org/docs/default-source/wcgalp-posters/499_paper_9538_manuscript_712_0.pdf?sfvrsn=2.
Varona, L., L. A. García-Cortés, and M. Pérez-Enciso. 2001. “Bayes Factors for Detection of Quantitative Trait Loci.” Genet Sel Evol 33 (2): 133–52. http://dx.doi.org/10.1051/gse:2001113.
Varona, Luis. 2010. “Understanding the Use of Bayes Factor for Testing Candidate Genes.” Journal of Animal Breeding and Genetics 127 (1): 16–25.
Varona, Luis, Andres Legarra, Miguel A. Toro, and Zulma G. Vitezica. 2018. “Non-Additive Effects in Genomic Selection.” Frontiers in Genetics 9. https://doi.org/10.3389/fgene.2018.00078.
Verbyla, Klara L, Ben J Hayes, Philip J Bowman, and Michael E Goddard. 2009. “Accuracy of Genomic Selection Using Stochastic Search Variable Selection in Australian Holstein Friesian Dairy Cattle.” Genet Res 91 (5): 307–11. http://dx.doi.org/10.1017/S0016672309990243.
Vidal, O, JL Noguera, M Amills, L Varona, M Gil, N Jiménez, G Davalos, JM Folch, and A Sanchez. 2005. “Identification of Carcass and Meat Quality Quantitative Trait Loci in a Landrace Pig Population Selected for Growth and Leanness.” Journal of Animal Science 83 (2): 293–300.
Villanueva, B., J. Fernández, LA García-Cortés, L. Varona, HD Daetwyler, and MA Toro. 2011. “Accuracy of Genome-Wide Evaluation for Disease Resistance in Aquaculture Breeding Programs.” Journal of Animal Science 89 (11): 3433–42.
Vitezica, Z. G., A. Legarra, M. A. Toro, and L. Varona. 2017. “Orthogonal Estimates of Variances for Additive, Dominance and Epistatic Effects in Populations.” Genetics, January, genetics.116.199406. https://doi.org/10.1534/genetics.116.199406.
Vitezica, ZG, Ignacio Aguilar, Ignacy Misztal, and Andres Legarra. 2011. “Bias in Genomic Predictions for Populations Under Selection.” Genetics Research 93 (05): 357–66.
Vitezica, Zulma G, Luis Varona, and Andres Legarra. 2013. “On the Additive and Dominant Variance and Covariance of Individuals Within the Genomic Selection Scope.” Genetics 195 (4): 1223–30.
Wakefield, Jon. 2009. “Bayes Factors for Genome‐wide Association Studies: Comparison with P‐values.” Genetic Epidemiology 33 (1): 79–86.
Wang, H, I Misztal, I Aguilar, A Legarra, and WM Muir. 2012. “Genome-Wide Association Mapping Including Phenotypes from Relatives Without Genotypes.” Genetics Research 94 (02): 73–83.
Wiggans, G. R., T. S. Sonstegard, P. M. VanRaden, L. K. Matukumalli, R. D. Schnabel, J. F. Taylor, F. S. Schenkel, and C. P. Van Tassell. 2009. “Selection of Single-Nucleotide Polymorphisms and Quality of Genotypes Used in Genomic Evaluation of Dairy Cattle in the United States and Canada.” J Dairy Sci 92 (7): 3431–36. http://dx.doi.org/10.3168/jds.2008-1758.
Wiggans, G. R., P. M. VanRaden, L. R. Bacheller, M. E. Tooker, J. L. Hutchison, T. A. Cooper, and T. S. Sonstegard. 2010. “Selection and Management of DNA Markers for Use in Genomic Evaluation.” Journal of Dairy Science 93 (5): 2287–92. https://doi.org/10.3168/jds.2009-2773.
Wright, Sewall. 1922. “Coefficients of Inbreeding and Relationship.” Am. Nat. 56: 330–38.
Xiang, Tao, Ole Fredslund Christensen, and Andres Legarra. 2017. “Technical Note: Genomic Evaluation for Crossbred Performance in a Single-Step Approach with Metafounders.” Journal of Animal Science 95 (4): 1472–80.
Xiang, Tao, Ole Fredslund Christensen, Zulma Gladis Vitezica, and Andres Legarra. 2016. “Genomic Evaluation by Including Dominance Effects and Inbreeding Depression for Purebred and Crossbred Performance with an Application in Pigs.” Genetics Selection Evolution 48: 92. https://doi.org/10.1186/s12711-016-0271-4.
Xu, Shizhong. 2013. “Mapping Quantitative Trait Loci by Controlling Polygenic Background Effects.” Genetics 195 (4): 1209–22. https://doi.org/10.1534/genetics.113.157032.
Xu, Shizhong, Dan Zhu, and Qifa Zhang. 2014. “Predicting Hybrid Performance in Rice Using Genomic Best Linear Unbiased Prediction.” Proceedings of the National Academy of Sciences 111 (34): 12456–61. https://doi.org/10.1073/pnas.1413750111.
Yang, Jian, Beben Benyamin, Brian P McEvoy, Scott Gordon, Anjali K Henders, Dale R Nyholt, Pamela A Madden, et al. 2010. “Common SNPs Explain a Large Proportion of the Heritability for Human Height.” Nat Genet 42 (7): 565–69. http://dx.doi.org/10.1038/ng.608.
Zhang, Zhe, Jianfeng Liu, Xiangdong Ding, Piter Bijma, Dirk-Jan de Koning, and Qin Zhang. 2010. “Best Linear Unbiased Prediction of Genomic Breeding Values Using a Trait-Specific Marker-Derived Relationship Matrix.” PLoS ONE 5 (9): e12648.