<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Bayesian | Ag Prophet</title>
    <link>https://agprophet.netlify.app/tag/bayesian/</link>
      <atom:link href="https://agprophet.netlify.app/tag/bayesian/index.xml" rel="self" type="application/rss+xml" />
    <description>Bayesian</description>
    <generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Wed, 03 Jun 2026 09:00:00 -0500</lastBuildDate>
    <image>
      <url>https://agprophet.netlify.app/media/icon_hu4af01c9d488c7bef25f821204732910a_1117_512x512_fill_lanczos_center_3.png</url>
      <title>Bayesian</title>
      <link>https://agprophet.netlify.app/tag/bayesian/</link>
    </image>
    
    <item>
      <title>How parameters are estimated</title>
      <link>https://agprophet.netlify.app/project/estimation/index_en/</link>
      <pubDate>Wed, 03 Jun 2026 09:00:00 -0500</pubDate>
      <guid>https://agprophet.netlify.app/project/estimation/index_en/</guid>
      <description>


&lt;p&gt;There were two courses I struggled with during my statistics minor at North
Carolina State University: ST 501 and ST 502. The second is the basis for this
tutorial. Most applied statistics courses teach you how to &lt;em&gt;call&lt;/em&gt; an estimation routine:
&lt;code&gt;lm()&lt;/code&gt;, &lt;code&gt;glm()&lt;/code&gt;, &lt;code&gt;lmer()&lt;/code&gt;, &lt;code&gt;brms::brm()&lt;/code&gt;. Press the button, read the output,
report the standard errors. This tutorial is about the layer underneath.&lt;/p&gt;
&lt;p&gt;Most applied statistics courses teach you how to &lt;em&gt;call&lt;/em&gt; an estimation routine:
&lt;code&gt;lm()&lt;/code&gt;, &lt;code&gt;glm()&lt;/code&gt;, &lt;code&gt;lmer()&lt;/code&gt;, &lt;code&gt;brms::brm()&lt;/code&gt;. Press the button, read the output,
report the standard errors.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;math.jpg&#34; alt=&#34;&#34; width=&#34;70%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This tutorial is about the layer underneath. It will answer a single question from four different philosophical directions:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Given data, and a model with unknown parameters, &lt;strong&gt;how is a number actually
chosen&lt;/strong&gt; to stand in for each parameter, and &lt;strong&gt;how much should we trust it?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We will not lean on black boxes. We will write likelihood functions by hand,
minimise them with &lt;code&gt;optim()&lt;/code&gt;, build a restricted likelihood from first
principles, and code a Gibbs sampler from its full conditionals. Every method
is applied to the &lt;em&gt;same&lt;/em&gt; simulated dataset whose true parameters we know, so we
can always check the answer against the truth.&lt;/p&gt;
&lt;p&gt;The four frameworks, in the order we meet them:&lt;/p&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;50%&#34; /&gt;
&lt;col width=&#34;50%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;One-line philosophy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;&lt;strong&gt;Method of Moments (MoM)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Match theoretical moments to sample moments and solve.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td&gt;&lt;strong&gt;Maximum Likelihood (MLE)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Choose the parameters that make the observed data most probable.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;&lt;strong&gt;Restricted ML (REML)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Maximise the likelihood of &lt;em&gt;error contrasts&lt;/em&gt; to debias variance components.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td&gt;&lt;strong&gt;Bayesian&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Treat parameters as random; update a prior into a posterior with Bayes’ rule.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;By the end you should be able to look at any of these outputs and know exactly
what objective function was optimised, what assumptions were made, and what the
reported uncertainty actually means.&lt;/p&gt;
&lt;div id=&#34;the-simulated-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The simulated data&lt;/h1&gt;
&lt;p&gt;Everything below lives inside one small, transparent data-generating process. We use a &lt;strong&gt;one-way random-effects model&lt;/strong&gt; (a balanced one-factor mixed
model). It is the simplest model rich enough to expose &lt;em&gt;all four&lt;/em&gt; methods at
their most interesting: it has a fixed effect (a grand mean) &lt;strong&gt;and&lt;/strong&gt; two
variance components, which is precisely the setting where REML is most useful.&lt;/p&gt;
&lt;div id=&#34;the-data-generating-process&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The data-generating process&lt;/h2&gt;
&lt;p&gt;Let there be &lt;span class=&#34;math inline&#34;&gt;\(g\)&lt;/span&gt; groups indexed by &lt;span class=&#34;math inline&#34;&gt;\(i = 1,\dots,g\)&lt;/span&gt;, with &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt; observations each,
indexed by &lt;span class=&#34;math inline&#34;&gt;\(j = 1,\dots,n\)&lt;/span&gt;, for a total of &lt;span class=&#34;math inline&#34;&gt;\(N = gn\)&lt;/span&gt; observations. The model is&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
y_{ij} \;=\; \mu \;+\; a_i \;+\; \varepsilon_{ij},
\qquad
a_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0,\sigma_a^2),
\qquad
\varepsilon_{ij} \stackrel{\text{iid}}{\sim} \mathcal{N}(0,\sigma_e^2),
\]&lt;/span&gt; {#eq-dgp}&lt;/p&gt;
&lt;p&gt;with the &lt;span class=&#34;math inline&#34;&gt;\(a_i\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\varepsilon_{ij}\)&lt;/span&gt; mutually independent. The three unknown
parameters are&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt;, the &lt;strong&gt;grand mean&lt;/strong&gt; (a &lt;em&gt;fixed effect&lt;/em&gt;),&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2\)&lt;/span&gt;, the &lt;strong&gt;between-group&lt;/strong&gt; variance (a &lt;em&gt;variance component&lt;/em&gt;),&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;math inline&#34;&gt;\(\sigma_e^2\)&lt;/span&gt;, the &lt;strong&gt;within-group / residual&lt;/strong&gt; variance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A useful derived quantity is the &lt;strong&gt;intraclass correlation (ICC)&lt;/strong&gt;,&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\rho \;=\; \frac{\sigma_a^2}{\sigma_a^2 + \sigma_e^2},
\]&lt;/span&gt; {#eq-icc}&lt;/p&gt;
&lt;p&gt;the correlation between two observations in the same group, and the fraction of
total variance attributable to group membership.&lt;/p&gt;
&lt;p&gt;We fix the truth and simulate:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;g  &amp;lt;- 10          # number of groups
n  &amp;lt;- 8           # observations per group (balanced)
N  &amp;lt;- g * n       # total observations

mu_true  &amp;lt;- 10    # grand mean       (fixed effect)
sa2_true &amp;lt;- 4     # between-group variance  sigma_a^2
se2_true &amp;lt;- 9     # within-group variance   sigma_e^2

a_true &amp;lt;- rnorm(g, mean = 0, sd = sqrt(sa2_true))   # the realised group effects
group  &amp;lt;- factor(rep(seq_len(g), each = n))
y      &amp;lt;- mu_true + a_true[as.integer(group)] +
          rnorm(N, mean = 0, sd = sqrt(se2_true))

dat &amp;lt;- data.frame(group = group, y = y)
icc_true &amp;lt;- sa2_true / (sa2_true + se2_true)

c(mu = mu_true, sigma_a2 = sa2_true, sigma_e2 = se2_true, ICC = round(icc_true, 3))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       mu sigma_a2 sigma_e2      ICC 
##   10.000    4.000    9.000    0.308&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So the &lt;em&gt;true&lt;/em&gt; ICC is 0.308: about a third of the variability
comes from differences between groups, two-thirds from noise within groups.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;looking-at-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Looking at the data&lt;/h2&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:fig-data&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://agprophet.netlify.app/project/estimation/index_en_files/figure-html/fig-data-1.png&#34; alt=&#34;Simulated data. Each light point is an observation; coloured diamonds are group means; the dashed line is the grand mean. The spread of group means reflects $\sigma_a^2$; the scatter within each column reflects $\sigma_e^2$.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: Simulated data. Each light point is an observation; coloured diamonds are group means; the dashed line is the grand mean. The spread of group means reflects &lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2\)&lt;/span&gt;; the scatter within each column reflects &lt;span class=&#34;math inline&#34;&gt;\(\sigma_e^2\)&lt;/span&gt;.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The entire tutorial is an attempt to recover the three numbers
(&lt;span class=&#34;math inline&#34;&gt;\(\mu, \sigma_a^2, \sigma_e^2\)&lt;/span&gt;) from this picture, using four different logics.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;a-common-notational-backbone&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;A common notational backbone&lt;/h2&gt;
&lt;div class=&#34;callout-note&#34; title=&#34;If you would rather skip the linear algebra&#34;&gt;
&lt;p&gt;The next few sections write the model with matrices, because that is the language MLE and REML are built in. If matrices are not your thing, read the shaded “in plain terms” lines and the figures, and skim past the symbols. Every method also has a simple closed-form answer for our balanced design, and we always show it.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;For the methods considered here, it is convenient to express the model in matrix notation. Stacking all observations into the vector &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{y}\in\mathbb{R}^N\)&lt;/span&gt;,&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\mathbf{y} \;=\; \mathbf{X}\boldsymbol\beta + \mathbf{Z}\mathbf{a} + \boldsymbol\varepsilon,
\]&lt;/span&gt; {#eq-matnot}&lt;/p&gt;
&lt;p&gt;where &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{X} = \mathbf{1}_N\)&lt;/span&gt; is the &lt;span class=&#34;math inline&#34;&gt;\(N\times 1\)&lt;/span&gt; design matrix for the single
fixed effect &lt;span class=&#34;math inline&#34;&gt;\(\boldsymbol\beta = \mu\)&lt;/span&gt; (so &lt;span class=&#34;math inline&#34;&gt;\(p = 1\)&lt;/span&gt;), &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{Z}\)&lt;/span&gt; is the
&lt;span class=&#34;math inline&#34;&gt;\(N\times g\)&lt;/span&gt; group-incidence matrix, &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{a}\sim\mathcal N(\mathbf 0, \sigma_a^2\mathbf I_g)\)&lt;/span&gt;,
and &lt;span class=&#34;math inline&#34;&gt;\(\boldsymbol\varepsilon\sim\mathcal N(\mathbf 0,\sigma_e^2\mathbf I_N)\)&lt;/span&gt;.
Because both the random effects and residual errors are normally distributed, the response vector &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{y}\)&lt;/span&gt; follows a multivariate normal distribution,&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\mathbf{y} \;\sim\; \mathcal{N}\!\big(\mathbf{X}\boldsymbol\beta,\; \mathbf{V}\big)
\]&lt;/span&gt; {#eq-marginal}&lt;/p&gt;
&lt;p&gt;with covariance matrix&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\mathbf{V} \;=\; \sigma_a^2\,\mathbf{Z}\mathbf{Z}^\top + \sigma_e^2\,\mathbf{I}_N .
\]&lt;/span&gt; {#eq-var}&lt;/p&gt;
&lt;p&gt;The matrix &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{V}\)&lt;/span&gt; describes the variance and covariance among all observations. For the balanced one-way random-effects model, &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{V}\)&lt;/span&gt; is &lt;em&gt;block-diagonal&lt;/em&gt; because observations from different groups do not share a random effect and are therefore uncorrelated. Within each group of size &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt;, the covariance has the &lt;em&gt;compound-symmetry&lt;/em&gt; form &lt;span class=&#34;math inline&#34;&gt;\(\sigma_e^2\mathbf{I}_n + \sigma_a^2\mathbf{J}_n\)&lt;/span&gt;, where &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{J}_n\)&lt;/span&gt; is the &lt;span class=&#34;math inline&#34;&gt;\(n \times n\)&lt;/span&gt; matrix of ones.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\mathbf{V}_{\text{group}}
\;=\;
\sigma_e^2\,\mathbf{I}_n + \sigma_a^2\,\mathbf{J}_n
\;=\;
\begin{pmatrix}
\sigma_a^2+\sigma_e^2 &amp;amp; \sigma_a^2 &amp;amp; \cdots &amp;amp; \sigma_a^2 \\
\sigma_a^2 &amp;amp; \sigma_a^2+\sigma_e^2 &amp;amp; \cdots &amp;amp; \sigma_a^2 \\
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots \\
\sigma_a^2 &amp;amp; \sigma_a^2 &amp;amp; \cdots &amp;amp; \sigma_a^2+\sigma_e^2
\end{pmatrix}.
\]&lt;/span&gt; {#eq-mat1}&lt;/p&gt;
&lt;p&gt;The full covariance matrix is obtained by placing &lt;span class=&#34;math inline&#34;&gt;\(g\)&lt;/span&gt; identical blocks along the diagonal (&lt;span class=&#34;math inline&#34;&gt;\(\mathbf{I}_g \otimes \mathbf{V}_{\text{group}}\)&lt;/span&gt;):&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\mathbf{V}
\;=\;
\bigoplus_{i=1}^{g}\mathbf{V}_{\text{group}}
\;=\;
\begin{pmatrix}
\mathbf{V}_{\text{group}} &amp;amp; \mathbf{0} &amp;amp; \cdots &amp;amp; \mathbf{0} \\
\mathbf{0} &amp;amp; \mathbf{V}_{\text{group}} &amp;amp; \cdots &amp;amp; \mathbf{0} \\
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots \\
\mathbf{0} &amp;amp; \mathbf{0} &amp;amp; \cdots &amp;amp; \mathbf{V}_{\text{group}}
\end{pmatrix}
\;=\;
\mathbf{I}_g \otimes \mathbf{V}_{\text{group}} .
\]&lt;/span&gt; {#eq-mat2}&lt;/p&gt;
&lt;p&gt;Each observation has variance &lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2+\sigma_e^2\)&lt;/span&gt;
(the diagonal), and any two observations in the same group share the field or
block effect &lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2\)&lt;/span&gt; (the off-diagonal). That shared term is exactly the
within-group correlation the ICC measures. Keep &lt;span class=&#34;citation&#34;&gt;@eq-marginal&lt;/span&gt; in mind: it is the
object that MLE and REML optimise.&lt;/p&gt;
&lt;p&gt;Finally, the &lt;strong&gt;analysis-of-variance decomposition&lt;/strong&gt; underlies the historical
(Method of Moments) estimators and gives clean closed forms we can check
everything against. With &lt;span class=&#34;math inline&#34;&gt;\(\bar y_{i\cdot}\)&lt;/span&gt; the &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt;-th group mean and
&lt;span class=&#34;math inline&#34;&gt;\(\bar y_{\cdot\cdot}\)&lt;/span&gt; the grand mean,&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\underbrace{\sum_{i,j}(y_{ij}-\bar y_{\cdot\cdot})^2}_{SS_{\text{total}}}
=\underbrace{n\sum_i (\bar y_{i\cdot}-\bar y_{\cdot\cdot})^2}_{SS_A \;(\text{df}=g-1)}
+\underbrace{\sum_{i,j}(y_{ij}-\bar y_{i\cdot})^2}_{SS_E \;(\text{df}=N-g)} .
\]&lt;/span&gt; {#eq-anova}&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ybar    &amp;lt;- mean(y)
ybar_i  &amp;lt;- tapply(y, group, mean)

SS_A &amp;lt;- n * sum((ybar_i - ybar)^2)                       # between groups
SS_E &amp;lt;- sum((y - ybar_i[as.integer(group)])^2)           # within groups
MS_A &amp;lt;- SS_A / (g - 1)                                   # between mean square
MS_E &amp;lt;- SS_E / (N - g)                                   # within  mean square

data.frame(
  Source = c(&amp;quot;Between (A)&amp;quot;, &amp;quot;Within (E)&amp;quot;),
  df     = c(g - 1, N - g),
  SS     = c(SS_A, SS_E),
  MS     = c(MS_A, MS_E)
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##        Source df       SS        MS
## 1 Between (A)  9 331.2220 36.802445
## 2  Within (E) 70 640.4952  9.149932&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We will reuse &lt;code&gt;SS_A&lt;/code&gt;, &lt;code&gt;SS_E&lt;/code&gt;, &lt;code&gt;MS_A&lt;/code&gt;, &lt;code&gt;MS_E&lt;/code&gt; repeatedly.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;sec-mom&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Method of Moments&lt;/h1&gt;
&lt;div id=&#34;philosophy&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Philosophy&lt;/h2&gt;
&lt;p&gt;The Method of Moments (Karl Pearson, 1894) is the oldest general-purpose
estimation principle and the most intuitive. A model with &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt; parameters implies
formulas for its first &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt; theoretical moments as functions of those parameters.
The data supply &lt;em&gt;empirical&lt;/em&gt; moments. &lt;strong&gt;Set the theoretical moments equal to the
empirical moments and solve.&lt;/strong&gt; No optimisation, no distributional likelihood,
just a system of equations. It requires the fewest assumptions of any method
here (you need the moments to exist, not full normality).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-estimating-equations&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The estimating equations&lt;/h2&gt;
&lt;p&gt;For a generic parameter vector &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;, let &lt;span class=&#34;math inline&#34;&gt;\(\mu_r&amp;#39;(\theta) = \mathbb{E}[Y^r]\)&lt;/span&gt; denote
the &lt;span class=&#34;math inline&#34;&gt;\(r\)&lt;/span&gt;-th theoretical raw moment and let &lt;span class=&#34;math inline&#34;&gt;\(m_r&amp;#39; = \tfrac1N\sum_i y_i^r\)&lt;/span&gt; denote the
corresponding sample moment. The method of moments (MoM) estimates the unknown parameters
by matching sample moments to their theoretical counterparts. Thus, the MoM estimator
&lt;span class=&#34;math inline&#34;&gt;\(\hat\theta\)&lt;/span&gt; solves&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\mu_r&amp;#39;(\hat\theta) \;=\; m_r&amp;#39;, \qquad r = 1,2,\dots,k .
\]&lt;/span&gt; {#eq-mom-general}&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Warm-up (a single normal sample).&lt;/strong&gt; Suppose &lt;span class=&#34;math inline&#34;&gt;\(y_i\sim\mathcal N(\mu,\sigma^2)\)&lt;/span&gt;. The
first two theoretical moments are then &lt;span class=&#34;math inline&#34;&gt;\(\mu_1&amp;#39;=\mu\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\mu_2&amp;#39;=\mu^2+\sigma^2\)&lt;/span&gt;.
Equating these to the corresponding sample moments yields &lt;span class=&#34;math inline&#34;&gt;\(\hat\mu=\bar y\)&lt;/span&gt; and
&lt;span class=&#34;math inline&#34;&gt;\(\hat\sigma^2 = \frac1N\sum_i(y_i-\bar y)^2\)&lt;/span&gt;, which are simply the sample mean and
the sample variance computed using &lt;span class=&#34;math inline&#34;&gt;\(N\)&lt;/span&gt; in the denominator. &lt;em&gt;Keep that division by &lt;span class=&#34;math inline&#34;&gt;\(N\)&lt;/span&gt;
in mind: it reappears when we discuss MLE and REML&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;moment-estimators-for-the-variance-components&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Moment estimators for the variance components&lt;/h2&gt;
&lt;p&gt;Our model has a grand mean and two variance components, so we need one additional moment equation beyond the overall mean. In a one-way layout, the natural quantities are the two &lt;em&gt;mean squares&lt;/em&gt;, whose expectations are the classical &lt;em&gt;expected mean squares&lt;/em&gt;:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\mathbb{E}[MS_E] = \sigma_e^2,
\qquad
\mathbb{E}[MS_A] = \sigma_e^2 + n\,\sigma_a^2 .
\]&lt;/span&gt; {#eq-ems}&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Sketch of &lt;span class=&#34;citation&#34;&gt;@eq-ems&lt;/span&gt;.&lt;/em&gt; Within a group, the deviations &lt;span class=&#34;math inline&#34;&gt;\(y_{ij}-\bar y_{i\cdot}\)&lt;/span&gt;
depend only on the &lt;span class=&#34;math inline&#34;&gt;\(\varepsilon\)&lt;/span&gt; terms, so the within-group sum of squares satisfies
&lt;span class=&#34;math inline&#34;&gt;\(SS_E/\sigma_e^2\sim\chi^2_{N-g}\)&lt;/span&gt;, giving &lt;span class=&#34;math inline&#34;&gt;\(\mathbb{E}[MS_E]=\sigma_e^2\)&lt;/span&gt;. For the between-group
component, each group mean can be written as &lt;span class=&#34;math inline&#34;&gt;\(\bar y_{i\cdot}=\mu+a_i+\bar\varepsilon_{i\cdot}\)&lt;/span&gt; with variance
&lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2+\sigma_e^2/n\)&lt;/span&gt;. Expressing the between-group sum of squares on a per-observation scale yields &lt;span class=&#34;math inline&#34;&gt;\(\mathbb{E}[MS_A]=\sigma_e^2+n\sigma_a^2\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Replacing the expectations in &lt;span class=&#34;citation&#34;&gt;@eq-ems&lt;/span&gt; by their observed values and solving the
&lt;span class=&#34;math inline&#34;&gt;\(2\times2\)&lt;/span&gt; system gives the &lt;strong&gt;ANOVA / Method-of-Moments estimators&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\boxed{\;\hat\mu_{\text{MoM}} = \bar y_{\cdot\cdot},\qquad
\hat\sigma^2_{e,\text{MoM}} = MS_E,\qquad
\hat\sigma^2_{a,\text{MoM}} = \dfrac{MS_A - MS_E}{n}.\;}
\]&lt;/span&gt; {#eq-mom-est}&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mom &amp;lt;- list(
  mu       = ybar,
  sigma_e2 = MS_E,
  sigma_a2 = (MS_A - MS_E) / n
)
mom$icc &amp;lt;- mom$sigma_a2 / (mom$sigma_a2 + mom$sigma_e2)
unlist(mom)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         mu   sigma_e2   sigma_a2        icc 
## 10.1173062  9.1499321  3.4565641  0.2741891&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;visualising-the-moment-match&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Visualising the moment match&lt;/h2&gt;
&lt;p&gt;MoM is “solve a linear system”, so the most honest visual is the system itself:
the two observed mean squares and the two parameters that reproduce them through
&lt;span class=&#34;citation&#34;&gt;@eq-ems&lt;/span&gt;.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:fig-mom&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://agprophet.netlify.app/project/estimation/index_en_files/figure-html/fig-mom-1.png&#34; alt=&#34;Method of Moments as a balance. The observed mean squares (left) are decomposed into the variance components (right) via $MS_E=\sigma_e^2$ and $MS_A=\sigma_e^2+n\sigma_a^2$.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 2: Method of Moments as a balance. The observed mean squares (left) are decomposed into the variance components (right) via &lt;span class=&#34;math inline&#34;&gt;\(MS_E=\sigma_e^2\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(MS_A=\sigma_e^2+n\sigma_a^2\)&lt;/span&gt;.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;comparison-with-the-truth&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Comparison with the truth&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data.frame(
  parameter = c(&amp;quot;mu&amp;quot;, &amp;quot;sigma_a2&amp;quot;, &amp;quot;sigma_e2&amp;quot;, &amp;quot;ICC&amp;quot;),
  truth     = c(mu_true, sa2_true, se2_true, icc_true),
  MoM       = c(mom$mu, mom$sigma_a2, mom$sigma_e2, mom$icc)
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   parameter      truth        MoM
## 1        mu 10.0000000 10.1173062
## 2  sigma_a2  4.0000000  3.4565641
## 3  sigma_e2  9.0000000  9.1499321
## 4       ICC  0.3076923  0.2741891&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;uncertainty-quantification&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Uncertainty quantification&lt;/h2&gt;
&lt;p&gt;For balanced data the relevant sums of squares have exact &lt;span class=&#34;math inline&#34;&gt;\(\chi^2\)&lt;/span&gt; sampling
distributions, which yields an &lt;strong&gt;exact&lt;/strong&gt; confidence interval for the residual
variance. Since &lt;span class=&#34;math inline&#34;&gt;\(SS_E/\sigma_e^2\sim\chi^2_{N-g}\)&lt;/span&gt;,&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\left[\;\frac{SS_E}{\chi^2_{N-g,\,1-\alpha/2}},\;\;
\frac{SS_E}{\chi^2_{N-g,\,\alpha/2}}\;\right]
\]&lt;/span&gt; {#eq-mom-ci}&lt;/p&gt;
&lt;p&gt;is a &lt;span class=&#34;math inline&#34;&gt;\(100(1-\alpha)\%\)&lt;/span&gt; interval for &lt;span class=&#34;math inline&#34;&gt;\(\sigma_e^2\)&lt;/span&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;alpha &amp;lt;- 0.05
mom_ci_se2 &amp;lt;- SS_E / qchisq(c(1 - alpha/2, alpha/2), df = N - g)

# Standard error of the grand mean under the balanced one-way random-effects
# model. The mean of group means has variance (sigma_a^2 + sigma_e^2/n)/g,
# which is estimated unbiasedly by MS_A/(g*n) since E[MS_A] = n*sigma_a^2 + sigma_e^2.
se_mu_mom &amp;lt;- sqrt(MS_A / (g * n))

setNames(mom_ci_se2, c(&amp;quot;lower&amp;quot;, &amp;quot;upper&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    lower    upper 
##  6.74041 13.13633&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;strong&gt;between&lt;/strong&gt; component is harder. &lt;span class=&#34;math inline&#34;&gt;\(\hat\sigma_a^2\)&lt;/span&gt; is a &lt;em&gt;difference&lt;/em&gt; of mean
squares, its sampling distribution is not &lt;span class=&#34;math inline&#34;&gt;\(\chi^2\)&lt;/span&gt;, and, notoriously, it &lt;strong&gt;can
come out negative&lt;/strong&gt; when &lt;span class=&#34;math inline&#34;&gt;\(MS_A &amp;lt; MS_E\)&lt;/span&gt;. Approximate intervals use Satterthwaite’s
effective degrees of freedom,&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\nu_{\text{eff}} = \frac{2\,(\hat\sigma_a^2)^2}{\operatorname{Var}(\hat\sigma_a^2)},
\qquad
\operatorname{Var}(\hat\sigma_a^2)\approx
\frac{2}{n^2}\!\left[\frac{MS_A^2}{g-1}+\frac{MS_E^2}{N-g}\right],
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;after which &lt;span class=&#34;math inline&#34;&gt;\(\nu_{\text{eff}}\hat\sigma_a^2/\sigma_a^2 \approx \chi^2_{\nu_{\text{eff}}}\)&lt;/span&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;var_sa2    &amp;lt;- (2 / n^2) * (MS_A^2 / (g - 1) + MS_E^2 / (N - g))
nu_eff     &amp;lt;- 2 * mom$sigma_a2^2 / var_sa2
mom_ci_sa2 &amp;lt;- nu_eff * mom$sigma_a2 / qchisq(c(1 - alpha/2, alpha/2), df = nu_eff)
c(se_sa2 = sqrt(var_sa2), nu_eff = nu_eff,
  lower = mom_ci_sa2[1], upper = mom_ci_sa2[2])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    se_sa2    nu_eff     lower     upper 
##  2.177205  5.041044  1.350738 20.576662&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;strengths-weaknesses-assumptions&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Strengths, weaknesses, assumptions&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Strengths.&lt;/strong&gt; Distribution-light (needs only that moments exist), closed-form
and instantaneous, an excellent &lt;em&gt;starting value&lt;/em&gt; for iterative methods, and the
conceptual root of GMM. For balanced ANOVA it is also fully efficient.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Weaknesses.&lt;/strong&gt; Can produce nonsensical estimates (negative variances); usually
less efficient than MLE for non-normal/unbalanced data; uncertainty
quantification is awkward and often only approximate; the choice of &lt;em&gt;which&lt;/em&gt;
moments to match is not unique and affects efficiency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Assumptions used here.&lt;/strong&gt; The first two moments are correctly specified
(see &lt;span class=&#34;citation&#34;&gt;@eq-ems&lt;/span&gt;) and the design is balanced (so the mean squares are independent and
the closed forms are clean).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Typical applications.&lt;/strong&gt; Quick variance-component estimates;
any setting where a full likelihood is unavailable or undesirable.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;sec-mle&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Maximum Likelihood Estimation&lt;/h1&gt;
&lt;div id=&#34;philosophy-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Philosophy&lt;/h2&gt;
&lt;p&gt;Fisher’s principle (1922) inverts the usual probabilistic question. Instead of
asking “given parameters, how probable is the data?”, it asks “which parameters
make the observed data &lt;em&gt;most&lt;/em&gt; probable?” The probability of the data, &lt;em&gt;read as a
function of the parameters with the data held fixed&lt;/em&gt;, is the &lt;strong&gt;likelihood&lt;/strong&gt;. The
MLE is its maximiser.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-likelihood-and-log-likelihood&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The likelihood and log-likelihood&lt;/h2&gt;
&lt;p&gt;The &lt;strong&gt;likelihood&lt;/strong&gt; scores a guess: choose values for the parameters
&lt;span class=&#34;math inline&#34;&gt;\(\boldsymbol\psi = (\boldsymbol\beta,\sigma_a^2,\sigma_e^2)\)&lt;/span&gt;, and it measures how
well those values explain the data you collected, with higher meaning a better
fit. Using the marginal model &lt;span class=&#34;citation&#34;&gt;@eq-marginal&lt;/span&gt;, that score is the multivariate
normal density evaluated at the data:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
L(\boldsymbol\psi)
= (2\pi)^{-N/2}\,|\mathbf{V}|^{-1/2}
\exp\!\Big\{\!-\tfrac12 (\mathbf{y}-\mathbf{X}\boldsymbol\beta)^\top
\mathbf{V}^{-1}(\mathbf{y}-\mathbf{X}\boldsymbol\beta)\Big\}.
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;You never work this out by hand; the computer does. Because it multiplies one
probability for every observation, the result is a vanishingly small number, so
we work on the log scale instead. The log does not change &lt;em&gt;which&lt;/em&gt; guess wins, it
just keeps the arithmetic stable:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\ell(\boldsymbol\psi)
= -\frac{N}{2}\log(2\pi)
  -\frac12\log|\mathbf{V}|
  -\frac12(\mathbf{y}-\mathbf{X}\boldsymbol\beta)^\top
   \mathbf{V}^{-1}(\mathbf{y}-\mathbf{X}\boldsymbol\beta).
\]&lt;/span&gt; {#eq-loglik}&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;citation&#34;&gt;@eq-loglik&lt;/span&gt; is the &lt;strong&gt;objective function&lt;/strong&gt;: the quantity we make as large as
possible, pictured in the next section as climbing to the top of a hill.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;In plain terms:&lt;/em&gt; the score rises when the model’s predictions sit close to the
observed values (the last term), but it is penalised for explaining that
closeness by simply assuming the data are noisier than they really are (the
&lt;span class=&#34;math inline&#34;&gt;\(\log|\mathbf{V}|\)&lt;/span&gt; term). Balancing those two is what picks out a single best
answer.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;profiling-out-the-fixed-effect&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Profiling out the fixed effect&lt;/h2&gt;
&lt;p&gt;There is a useful shortcut here. Suppose, for a moment, that we already knew the
two variances (and therefore &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{V}\)&lt;/span&gt;). Then the best grand mean has a tidy
formula: the value of &lt;span class=&#34;math inline&#34;&gt;\(\boldsymbol\beta\)&lt;/span&gt; that maximises &lt;span class=&#34;citation&#34;&gt;@eq-loglik&lt;/span&gt; is the
&lt;strong&gt;generalised least squares (GLS)&lt;/strong&gt; estimate, found by setting the derivative of
the log-likelihood to zero, &lt;span class=&#34;math inline&#34;&gt;\(\partial\ell/\partial\boldsymbol\beta=0\)&lt;/span&gt;:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\hat{\boldsymbol\beta}(\mathbf{V})
= \big(\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{X}\big)^{-1}
  \mathbf{X}^\top\mathbf{V}^{-1}\mathbf{y}.
\]&lt;/span&gt; {#eq-gls}&lt;/p&gt;
&lt;div class=&#34;callout-note&#34; title=&#34;Connection to ordinary least squares&#34;&gt;
&lt;p&gt;You have likely met the simpler special case. When every observation is
independent and has the same variance, so that &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{V} = \sigma^2\mathbf{I}\)&lt;/span&gt;,
the formula above collapses to the familiar &lt;strong&gt;ordinary least squares (OLS)&lt;/strong&gt;
estimate &lt;span class=&#34;math inline&#34;&gt;\(\hat{\boldsymbol\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\)&lt;/span&gt;.
GLS is just OLS with the observations re-weighted by &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{V}^{-1}\)&lt;/span&gt;, so that
correlated or noisier observations count for less. For our balanced design the
two even give the same number, the grand mean &lt;span class=&#34;math inline&#34;&gt;\(\bar y_{\cdot\cdot}\)&lt;/span&gt;; they part
ways once the groups become unbalanced.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Because the mean now has a formula written in terms of the variances, we can
substitute &lt;span class=&#34;citation&#34;&gt;@eq-gls&lt;/span&gt; back into &lt;span class=&#34;citation&#34;&gt;@eq-loglik&lt;/span&gt; and be left with a function of the two
variances alone, the &lt;strong&gt;profile log-likelihood&lt;/strong&gt; &lt;span class=&#34;math inline&#34;&gt;\(\ell_p(\sigma_a^2,\sigma_e^2)\)&lt;/span&gt;.
That is the quantity the computer actually maximises. One practical trick: we
search over the &lt;em&gt;logs&lt;/em&gt; of the variances, &lt;span class=&#34;math inline&#34;&gt;\(\theta_1=\log\sigma_a^2\)&lt;/span&gt; and
&lt;span class=&#34;math inline&#34;&gt;\(\theta_2=\log\sigma_e^2\)&lt;/span&gt;, so the optimiser can roam freely while the variances
themselves stay positive.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;X &amp;lt;- matrix(1, N, 1)                     # single fixed effect (intercept = mu)
Z &amp;lt;- model.matrix(~ group - 1)           # N x g group-incidence matrix

make_V &amp;lt;- function(sa2, se2) sa2 * tcrossprod(Z) + se2 * diag(N)

# Negative *profile* log-likelihood (so we can MINIMISE with optim()).
neg_loglik &amp;lt;- function(theta) {
  sa2 &amp;lt;- exp(theta[1]); se2 &amp;lt;- exp(theta[2])
  V   &amp;lt;- make_V(sa2, se2)
  R   &amp;lt;- chol(V)                         # V = R&amp;#39;R, R upper-triangular
  logdetV &amp;lt;- 2 * sum(log(diag(R)))       # stable log-determinant
  Vinv    &amp;lt;- chol2inv(R)
  XtVi    &amp;lt;- crossprod(X, Vinv)
  beta    &amp;lt;- solve(XtVi %*% X, XtVi %*% y)   # GLS estimate, eq. (GLS)
  r       &amp;lt;- y - X %*% beta
  quad    &amp;lt;- as.numeric(crossprod(r, Vinv %*% r))
  0.5 * (N * log(2*pi) + logdetV + quad)     # = -ell_p
}&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;start &amp;lt;- log(c(var(y) / 2, var(y) / 2))         # crude starting values
fit_ml &amp;lt;- optim(start, neg_loglik, method = &amp;quot;BFGS&amp;quot;, hessian = TRUE)

mle &amp;lt;- list(
  sigma_a2 = exp(fit_ml$par[1]),
  sigma_e2 = exp(fit_ml$par[2]),
  loglik   = -fit_ml$value
)
# Recover the fixed effect at the optimum via GLS:
Vhat &amp;lt;- make_V(mle$sigma_a2, mle$sigma_e2)
Vinv &amp;lt;- solve(Vhat)
mle$mu  &amp;lt;- as.numeric(solve(crossprod(X, Vinv) %*% X, crossprod(X, Vinv) %*% y))
mle$icc &amp;lt;- mle$sigma_a2 / (mle$sigma_a2 + mle$sigma_e2)
unlist(mle[c(&amp;quot;mu&amp;quot;,&amp;quot;sigma_a2&amp;quot;,&amp;quot;sigma_e2&amp;quot;,&amp;quot;icc&amp;quot;,&amp;quot;loglik&amp;quot;)])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##           mu     sigma_a2     sigma_e2          icc       loglik 
##   10.1173062    2.9965344    9.1499324    0.2467001 -208.4972277&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;closed-forms-confirm-the-optimiser&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Closed forms confirm the optimiser&lt;/h2&gt;
&lt;p&gt;For this balanced design the MLE has a closed form, and comparing it to &lt;code&gt;optim()&lt;/code&gt;
reassures us the numerics are right, &lt;em&gt;and&lt;/em&gt; exposes the bias that motivates the
next section. The likelihood factorises (because &lt;span class=&#34;math inline&#34;&gt;\(\bar y_{\cdot\cdot}\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(SS_A\)&lt;/span&gt; and
&lt;span class=&#34;math inline&#34;&gt;\(SS_E\)&lt;/span&gt; are mutually independent), giving&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\hat\mu_{\text{ML}} = \bar y_{\cdot\cdot},\qquad
\hat\sigma^2_{e,\text{ML}} = MS_E,\qquad
\hat\sigma^2_{a,\text{ML}} = \frac{1}{n}\!\left(\frac{SS_A}{g} - MS_E\right).
\]&lt;/span&gt; {#eq-mle-closed}&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mle_closed &amp;lt;- c(
  mu       = ybar,
  sigma_e2 = MS_E,
  sigma_a2 = ((SS_A / g) - MS_E) / n     # NOTE the divisor g, not g-1
)
rbind(closed_form = mle_closed,
      optim       = c(mle$mu, mle$sigma_e2, mle$sigma_a2))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                   mu sigma_e2 sigma_a2
## closed_form 10.11731 9.149932 2.996534
## optim       10.11731 9.149932 2.996534&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Compare &lt;span class=&#34;citation&#34;&gt;@eq-mle-closed&lt;/span&gt; with the MoM estimator &lt;span class=&#34;citation&#34;&gt;@eq-mom-est&lt;/span&gt;. They differ in
one place: ML divides the between sum of squares &lt;span class=&#34;math inline&#34;&gt;\(SS_A\)&lt;/span&gt; by &lt;strong&gt;&lt;span class=&#34;math inline&#34;&gt;\(g\)&lt;/span&gt;&lt;/strong&gt;, the number of
groups; MoM (and, as we will see, REML) divides by &lt;strong&gt;&lt;span class=&#34;math inline&#34;&gt;\(g-1\)&lt;/span&gt;&lt;/strong&gt;. That single
degree of freedom, the one spent estimating &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt;, is the entire story of the
next section. Because &lt;span class=&#34;math inline&#34;&gt;\(g &amp;gt; g-1\)&lt;/span&gt;, &lt;strong&gt;ML systematically &lt;em&gt;underestimates&lt;/em&gt;
&lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2\)&lt;/span&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;visualising-the-estimation-the-likelihood-surface&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Visualising the estimation: the likelihood surface&lt;/h2&gt;
&lt;p&gt;MLE is hill-climbing on &lt;span class=&#34;citation&#34;&gt;@eq-loglik&lt;/span&gt;. Let us draw the hill (over the two variance
components, with &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt; profiled out) and mark where &lt;code&gt;optim()&lt;/code&gt; stopped.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sa2_grid &amp;lt;- seq(0.2, 12, length.out = 70)
se2_grid &amp;lt;- seq(5,   14, length.out = 70)
grid &amp;lt;- expand.grid(sa2 = sa2_grid, se2 = se2_grid)
grid$ll &amp;lt;- apply(grid, 1, function(r) -neg_loglik(log(c(r[&amp;quot;sa2&amp;quot;], r[&amp;quot;se2&amp;quot;]))))
grid$rel &amp;lt;- grid$ll - max(grid$ll)&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:fig-mle-surface&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://agprophet.netlify.app/project/estimation/index_en_files/figure-html/fig-mle-surface-1.png&#34; alt=&#34;The profile log-likelihood surface over the two variance components, with $\mu$ concentrated out. The orange dot is the MLE; the black cross is the true $(\sigma_a^2,\sigma_e^2)$. Contours are log-likelihood units below the maximum.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 3: The profile log-likelihood surface over the two variance components, with &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt; concentrated out. The orange dot is the MLE; the black cross is the true &lt;span class=&#34;math inline&#34;&gt;\((\sigma_a^2,\sigma_e^2)\)&lt;/span&gt;. Contours are log-likelihood units below the maximum.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;A one-dimensional &lt;strong&gt;profile&lt;/strong&gt; for &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt; makes the curvature-uncertainty link
concrete: how sharply the log-likelihood falls away from its peak &lt;em&gt;is&lt;/em&gt; the
information about &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mu_grid &amp;lt;- seq(mle$mu - 3, mle$mu + 3, length.out = 200)
ll_mu &amp;lt;- sapply(mu_grid, function(m) {
  r &amp;lt;- y - m
  -0.5 * (N * log(2*pi) + as.numeric(determinant(Vhat, TRUE)$modulus) +
          as.numeric(crossprod(r, Vinv %*% r)))
})
prof_mu &amp;lt;- data.frame(mu = mu_grid, rel = ll_mu - max(ll_mu))&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:fig-mle-mu-profile&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://agprophet.netlify.app/project/estimation/index_en_files/figure-html/fig-mle-mu-profile-1.png&#34; alt=&#34;Profile log-likelihood for $\mu$ (variance components held at their MLEs). Curvature at the peak equals the Fisher information; flatter peaks mean larger standard errors. The horizontal line at $-1.92$ cuts a 95% likelihood interval.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 4: Profile log-likelihood for &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt; (variance components held at their MLEs). Curvature at the peak equals the Fisher information; flatter peaks mean larger standard errors. The horizontal line at &lt;span class=&#34;math inline&#34;&gt;\(-1.92\)&lt;/span&gt; cuts a 95% likelihood interval.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;comparison-with-the-truth-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Comparison with the truth&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;##   parameter      truth        MoM        MLE
## 1        mu 10.0000000 10.1173062 10.1173062
## 2  sigma_a2  4.0000000  3.4565641  2.9965344
## 3  sigma_e2  9.0000000  9.1499321  9.1499324
## 4       ICC  0.3076923  0.2741891  0.2467001&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The MLE of &lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2\)&lt;/span&gt; sits below both the truth and the MoM estimate: the
expected downward bias.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;uncertainty-quantification-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Uncertainty quantification&lt;/h2&gt;
&lt;p&gt;How confident should we be in an MLE? The key idea is intuitive: the &lt;strong&gt;sharper
the peak&lt;/strong&gt; of the log-likelihood, the more precisely the data pin down the
estimate. A narrow, steep peak means only a small range of values fit well, so
the standard error is small; a broad, flat peak means many values fit almost
equally well, so the estimate is uncertain.&lt;/p&gt;
&lt;p&gt;A large-sample result makes this precise. With enough data, an MLE behaves like
a draw from a normal distribution centred on the true value, with a spread set by
the curvature of that peak:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\hat{\boldsymbol\theta}\;\dot\sim\;
\mathcal N\!\big(\boldsymbol\theta,\; \mathcal I(\boldsymbol\theta)^{-1}\big),
\qquad
\mathcal I(\boldsymbol\theta)
= -\,\mathbb{E}\!\left[\frac{\partial^2 \ell}{\partial\boldsymbol\theta\,\partial\boldsymbol\theta^\top}\right].
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The matrix &lt;span class=&#34;math inline&#34;&gt;\(\mathcal I(\boldsymbol\theta)\)&lt;/span&gt;, the &lt;strong&gt;Fisher information&lt;/strong&gt;, is just a
measure of that curvature, built from the second derivatives of &lt;span class=&#34;math inline&#34;&gt;\(\ell\)&lt;/span&gt;.
Inverting it turns “how curved” into “how variable,” which gives the variances
and standard errors of the estimates. In practice we read the curvature straight
off the fitted model: the &lt;strong&gt;observed information&lt;/strong&gt; is the Hessian (the matrix of
second derivatives) of &lt;span class=&#34;math inline&#34;&gt;\(-\ell\)&lt;/span&gt; at the optimum, which &lt;code&gt;optim()&lt;/code&gt; already returns.&lt;/p&gt;
&lt;p&gt;One last step. Because we maximised on the log-variance scale, the standard
errors arrive on that scale too, and the &lt;strong&gt;delta method&lt;/strong&gt; converts them back. If
&lt;span class=&#34;math inline&#34;&gt;\(\sigma^2 = e^{\theta}\)&lt;/span&gt;, then
&lt;span class=&#34;math inline&#34;&gt;\(\widehat{\mathrm{SE}}(\sigma^2) = \sigma^2\,\widehat{\mathrm{SE}}(\theta)\)&lt;/span&gt;. This
is just the chain rule: a small wobble in &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; shows up as a wobble in
&lt;span class=&#34;math inline&#34;&gt;\(\sigma^2\)&lt;/span&gt; scaled by the derivative &lt;span class=&#34;math inline&#34;&gt;\(e^\theta = \sigma^2\)&lt;/span&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cov_theta &amp;lt;- solve(fit_ml$hessian)          # covariance of (log sa2, log se2)
se_theta  &amp;lt;- sqrt(diag(cov_theta))

se_sa2 &amp;lt;- mle$sigma_a2 * se_theta[1]        # delta method
se_se2 &amp;lt;- mle$sigma_e2 * se_theta[2]

# Wald 95% CIs are cleanest on the log scale, then back-transformed (keeps them &amp;gt; 0):
ci_sa2 &amp;lt;- exp(fit_ml$par[1] + c(-1, 1) * 1.96 * se_theta[1])
ci_se2 &amp;lt;- exp(fit_ml$par[2] + c(-1, 1) * 1.96 * se_theta[2])

# SE of the GLS mean: Var(mu_hat) = (X&amp;#39; Vinv X)^{-1}
se_mu  &amp;lt;- sqrt(as.numeric(solve(crossprod(X, Vinv) %*% X)))
ci_mu  &amp;lt;- mle$mu + c(-1, 1) * 1.96 * se_mu

data.frame(
  parameter = c(&amp;quot;mu&amp;quot;, &amp;quot;sigma_a2&amp;quot;, &amp;quot;sigma_e2&amp;quot;),
  estimate  = c(mle$mu, mle$sigma_a2, mle$sigma_e2),
  SE        = c(se_mu, se_sa2, se_se2),
  lower95   = c(ci_mu[1], ci_sa2[1], ci_se2[1]),
  upper95   = c(ci_mu[2], ci_sa2[2], ci_se2[2])
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   parameter  estimate        SE   lower95  upper95
## 1        mu 10.117306 0.6434498 8.8561446 11.37847
## 2  sigma_a2  2.996534 1.8616536 0.8867159 10.12638
## 3  sigma_e2  9.149932 1.5466206 6.5695549 12.74383&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(The Wald interval for a variance can misbehave near zero; the log-scale
construction above avoids negative limits, and a profile-likelihood interval
(inverting the curve in &lt;span class=&#34;citation&#34;&gt;@fig-mle-mu-profile&lt;/span&gt;) is better still when samples are
small.)&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;strengths-weaknesses-assumptions-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Strengths, weaknesses, assumptions&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Strengths.&lt;/strong&gt; Asymptotically efficient (attains the Cramér–Rao bound),
invariant to reparameterisation, a single coherent framework for point
estimates &lt;em&gt;and&lt;/em&gt; uncertainty, and the basis for likelihood-ratio tests and
information criteria (AIC).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Weaknesses.&lt;/strong&gt; Variance components are &lt;strong&gt;biased in finite samples&lt;/strong&gt; (the
&lt;span class=&#34;math inline&#34;&gt;\(\div g\)&lt;/span&gt; problem); relies on asymptotics that can be poor with few groups; the
full distribution must be specified correctly; the optimisation can be
multi-modal or land on a boundary (&lt;span class=&#34;math inline&#34;&gt;\(\hat\sigma_a^2 = 0\)&lt;/span&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Assumptions used here.&lt;/strong&gt; Correct normal model &lt;span class=&#34;citation&#34;&gt;@eq-marginal&lt;/span&gt;; large-sample
regularity for the information-based standard errors.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Typical applications.&lt;/strong&gt; The default for GLMs, survival models, structural
equation models, and as the engine inside almost every modern estimation
routine.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;sec-reml&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Restricted Maximum Likelihood&lt;/h1&gt;
&lt;div id=&#34;philosophy-2&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Philosophy&lt;/h2&gt;
&lt;p&gt;REML (Patterson &amp;amp; Thompson, 1971) exists to fix one specific flaw: ordinary ML
estimates variance components &lt;strong&gt;as if the fixed effects were known&lt;/strong&gt;, when in
fact they were estimated from the same data. Estimating &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; fixed effects uses up
&lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; degrees of freedom that ML never accounts for, so its variance estimates are
biased downward. REML’s remedy is elegant: &lt;strong&gt;estimate the variance components
from a part of the data that carries no information about the fixed effects&lt;/strong&gt;,
the &lt;em&gt;error contrasts&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-cleanest-possible-illustration-n-versus-n-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The cleanest possible illustration (&lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt; versus &lt;span class=&#34;math inline&#34;&gt;\(n-1\)&lt;/span&gt;)&lt;/h2&gt;
&lt;p&gt;Take the simplest possible case: a single normal sample
&lt;span class=&#34;math inline&#34;&gt;\(y_i\sim\mathcal N(\mu,\sigma^2)\)&lt;/span&gt;. The MLE of the variance,
&lt;span class=&#34;math inline&#34;&gt;\(\hat\sigma^2_{\text{ML}} = \frac1n\sum(y_i-\bar y)^2\)&lt;/span&gt;, comes out too small,
because we used the same data to estimate &lt;span class=&#34;math inline&#34;&gt;\(\bar y\)&lt;/span&gt; and then measured the spread
around it. The fix is to work with the deviations &lt;span class=&#34;math inline&#34;&gt;\(y_i - \bar y\)&lt;/span&gt; instead. These
&lt;span class=&#34;math inline&#34;&gt;\(n-1\)&lt;/span&gt; &lt;strong&gt;contrasts&lt;/strong&gt; sum to zero and carry &lt;em&gt;no information about &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt;&lt;/em&gt;, so
building the likelihood from them drops &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt; out of the problem entirely and
gives&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\hat\sigma^2_{\text{REML}} = \frac{1}{n-1}\sum_i (y_i-\bar y)^2 ,
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;the &lt;strong&gt;unbiased&lt;/strong&gt; sample variance Estimating the one
mean &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt; cost a single degree of freedom, and REML hands it back. &lt;em&gt;That is the
whole idea.&lt;/em&gt; Everything below is the same idea written in matrix form.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-restricted-likelihood&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The restricted likelihood&lt;/h2&gt;
&lt;p&gt;Rather than analysing the raw responses, REML analyses &lt;strong&gt;error contrasts&lt;/strong&gt;:
combinations of the data deliberately built to contain no trace of the fixed
effects. Formally, pick any full-rank matrix (A full rank matrix is a matrix
where all rows and columns contain zero redundant information, meaning it has
the maximum possible number of linearly independent rows or columns for its size)
&lt;span class=&#34;math inline&#34;&gt;\(\mathbf{K}_{N\times(N-p)}\)&lt;/span&gt; with &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{K}^\top\mathbf{X}=\mathbf 0\)&lt;/span&gt;; the transformed data
&lt;span class=&#34;math inline&#34;&gt;\(\mathbf{K}^\top\mathbf{y}\)&lt;/span&gt; are those contrasts. From &lt;span class=&#34;citation&#34;&gt;@eq-marginal&lt;/span&gt;,&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\mathbf{K}^\top\mathbf{y} \;\sim\;
\mathcal N\!\big(\mathbf 0,\; \mathbf{K}^\top\mathbf{V}\mathbf{K}\big),
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;a distribution that &lt;strong&gt;does not involve &lt;span class=&#34;math inline&#34;&gt;\(\boldsymbol\beta\)&lt;/span&gt; at all&lt;/strong&gt;. Its
log-likelihood (which, remarkably, does not depend on which &lt;span class=&#34;math inline&#34;&gt;\(\mathbf K\)&lt;/span&gt; you pick)
is the &lt;strong&gt;restricted log-likelihood&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\ell_R(\boldsymbol\theta)
= -\frac{N-p}{2}\log(2\pi)
  -\frac12\log|\mathbf{V}|
  \;\underbrace{-\frac12\log\big|\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{X}\big|}_{\text{the degrees-of-freedom penalty}}
  -\frac12(\mathbf{y}-\mathbf{X}\hat{\boldsymbol\beta})^\top
   \mathbf{V}^{-1}(\mathbf{y}-\mathbf{X}\hat{\boldsymbol\beta}).
\]&lt;/span&gt; {#eq-reml}&lt;/p&gt;
&lt;p&gt;Lined up against the profile ML objective, REML adds exactly one term,
&lt;span class=&#34;math inline&#34;&gt;\(-\tfrac12\log|\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{X}|\)&lt;/span&gt;. That term is the price
of estimating &lt;span class=&#34;math inline&#34;&gt;\(\boldsymbol\beta\)&lt;/span&gt;: the matrix version of dividing by &lt;span class=&#34;math inline&#34;&gt;\(n-1\)&lt;/span&gt; instead
of &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt;, and it grows larger the more fixed effects you estimate. &lt;span class=&#34;citation&#34;&gt;@eq-reml&lt;/span&gt; &lt;strong&gt;is
the objective function&lt;/strong&gt; REML optimises.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;neg_restricted_loglik &amp;lt;- function(theta) {
  sa2 &amp;lt;- exp(theta[1]); se2 &amp;lt;- exp(theta[2])
  V   &amp;lt;- make_V(sa2, se2)
  R   &amp;lt;- chol(V)
  logdetV &amp;lt;- 2 * sum(log(diag(R)))
  Vinv    &amp;lt;- chol2inv(R)
  XtVi    &amp;lt;- crossprod(X, Vinv)
  XtViX   &amp;lt;- XtVi %*% X
  beta    &amp;lt;- solve(XtViX, XtVi %*% y)
  r       &amp;lt;- y - X %*% beta
  quad    &amp;lt;- as.numeric(crossprod(r, Vinv %*% r))
  p       &amp;lt;- ncol(X)
  logdet_XtViX &amp;lt;- as.numeric(determinant(XtViX, logarithm = TRUE)$modulus)
  # = -ell_R  (the lone extra term vs. neg_loglik is logdet_XtViX)
  0.5 * ((N - p) * log(2*pi) + logdetV + logdet_XtViX + quad)
}&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit_reml &amp;lt;- optim(start, neg_restricted_loglik, method = &amp;quot;BFGS&amp;quot;, hessian = TRUE)

reml &amp;lt;- list(
  sigma_a2 = exp(fit_reml$par[1]),
  sigma_e2 = exp(fit_reml$par[2])
)
Vhat_r &amp;lt;- make_V(reml$sigma_a2, reml$sigma_e2)
Vinv_r &amp;lt;- solve(Vhat_r)
reml$mu  &amp;lt;- as.numeric(solve(crossprod(X, Vinv_r) %*% X, crossprod(X, Vinv_r) %*% y))
reml$icc &amp;lt;- reml$sigma_a2 / (reml$sigma_a2 + reml$sigma_e2)
unlist(reml[c(&amp;quot;mu&amp;quot;,&amp;quot;sigma_a2&amp;quot;,&amp;quot;sigma_e2&amp;quot;,&amp;quot;icc&amp;quot;)])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         mu   sigma_a2   sigma_e2        icc 
## 10.1173062  3.4565333  9.1507465  0.2741696&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For the balanced one-way model, REML has the closed form&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\hat\sigma^2_{e,\text{REML}} = MS_E,\qquad
\hat\sigma^2_{a,\text{REML}} = \frac{MS_A - MS_E}{n}
= \frac{1}{n}\!\left(\frac{SS_A}{g-1} - MS_E\right),
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;i.e. &lt;strong&gt;REML reproduces the ANOVA/MoM estimator exactly&lt;/strong&gt;: it divides &lt;span class=&#34;math inline&#34;&gt;\(SS_A\)&lt;/span&gt; by
&lt;span class=&#34;math inline&#34;&gt;\(g-1\)&lt;/span&gt;, restoring the degree of freedom that ML dropped.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rbind(
  optim_REML  = c(sigma_a2 = reml$sigma_a2, sigma_e2 = reml$sigma_e2),
  closed_REML = c(sigma_a2 = (MS_A - MS_E)/n, sigma_e2 = MS_E),
  closed_MoM  = c(sigma_a2 = mom$sigma_a2,    sigma_e2 = mom$sigma_e2),
  closed_MLE  = c(sigma_a2 = ((SS_A/g) - MS_E)/n, sigma_e2 = MS_E)
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##             sigma_a2 sigma_e2
## optim_REML  3.456533 9.150746
## closed_REML 3.456564 9.149932
## closed_MoM  3.456564 9.149932
## closed_MLE  2.996534 9.149932&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Optional black-box sanity check: lme4 defaults to REML and should match ours.
fm &amp;lt;- lme4::lmer(y ~ 1 + (1 | group), data = dat, REML = TRUE)
vc &amp;lt;- as.data.frame(lme4::VarCorr(fm))
cat(&amp;quot;lme4 REML  sigma_a2 =&amp;quot;, round(vc$vcov[1], 4),
    &amp;quot; sigma_e2 =&amp;quot;, round(vc$vcov[2], 4), &amp;quot;\n&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## lme4 REML  sigma_a2 = 3.4566  sigma_e2 = 9.1499&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cat(&amp;quot;ours REML  sigma_a2 =&amp;quot;, round(reml$sigma_a2, 4),
    &amp;quot; sigma_e2 =&amp;quot;, round(reml$sigma_e2, 4), &amp;quot;\n&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## ours REML  sigma_a2 = 3.4565  sigma_e2 = 9.1507&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;visualising-the-bias-correction&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Visualising the bias correction&lt;/h2&gt;
&lt;p&gt;The most instructive REML picture overlays the two &lt;strong&gt;profile log-likelihoods&lt;/strong&gt;
for &lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2\)&lt;/span&gt;, ordinary and restricted, each maximised over &lt;span class=&#34;math inline&#34;&gt;\(\sigma_e^2\)&lt;/span&gt;.
The restricted curve’s peak sits to the &lt;em&gt;right&lt;/em&gt; of the ML peak: REML pushes the
between-group variance estimate upward, undoing the ML bias.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;profile_over_se2 &amp;lt;- function(sa2, fn) {
  o &amp;lt;- optimize(function(lse2) fn(c(log(sa2), lse2)),
                lower = log(1e-3), upper = log(1e3))
  -o$objective
}
sa2_seq &amp;lt;- seq(0.05, 12, length.out = 120)
ll_ml   &amp;lt;- sapply(sa2_seq, profile_over_se2, fn = neg_loglik)
ll_reml &amp;lt;- sapply(sa2_seq, profile_over_se2, fn = neg_restricted_loglik)

prof &amp;lt;- rbind(
  data.frame(sigma_a2 = sa2_seq, rel = ll_ml   - max(ll_ml),   method = &amp;quot;MLE&amp;quot;),
  data.frame(sigma_a2 = sa2_seq, rel = ll_reml - max(ll_reml), method = &amp;quot;REML&amp;quot;)
)&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:fig-reml-shift&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://agprophet.netlify.app/project/estimation/index_en_files/figure-html/fig-reml-shift-1.png&#34; alt=&#34;Profile (ordinary) vs. restricted profile log-likelihood for $\sigma_a^2$, each normalised to peak at zero. REML&#39;s peak is shifted toward the truth, illustrating the bias correction from the extra $-\tfrac12\log|X^\top V^{-1}X|$ term.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 5: Profile (ordinary) vs. restricted profile log-likelihood for &lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2\)&lt;/span&gt;, each normalised to peak at zero. REML’s peak is shifted toward the truth, illustrating the bias correction from the extra &lt;span class=&#34;math inline&#34;&gt;\(-\tfrac12\log|X^\top V^{-1}X|\)&lt;/span&gt; term.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;comparison-with-the-truth-2&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Comparison with the truth&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;##   parameter      truth        MLE       REML
## 1        mu 10.0000000 10.1173062 10.1173062
## 2  sigma_a2  4.0000000  2.9965344  3.4565333
## 3  sigma_e2  9.0000000  9.1499324  9.1507465
## 4       ICC  0.3076923  0.2467001  0.2741696&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;uncertainty-quantification-2&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Uncertainty quantification&lt;/h2&gt;
&lt;p&gt;Inference proceeds exactly as for ML, but using the Hessian of &lt;span class=&#34;citation&#34;&gt;@eq-reml&lt;/span&gt;. The
resulting standard errors for variance components are generally &lt;em&gt;better
calibrated&lt;/em&gt; in small samples because the objective already accounts for fixed-
effect estimation. The standard error of &lt;span class=&#34;math inline&#34;&gt;\(\hat\mu\)&lt;/span&gt; still comes from the GLS
formula &lt;span class=&#34;math inline&#34;&gt;\(\operatorname{Var}(\hat\mu)=(\mathbf X^\top\mathbf V^{-1}\mathbf X)^{-1}\)&lt;/span&gt;,
now evaluated at the REML &lt;span class=&#34;math inline&#34;&gt;\(\mathbf V\)&lt;/span&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cov_theta_r &amp;lt;- solve(fit_reml$hessian)
se_theta_r  &amp;lt;- sqrt(diag(cov_theta_r))
se_sa2_r &amp;lt;- reml$sigma_a2 * se_theta_r[1]
se_se2_r &amp;lt;- reml$sigma_e2 * se_theta_r[2]
se_mu_r  &amp;lt;- sqrt(as.numeric(solve(crossprod(X, Vinv_r) %*% X)))

data.frame(
  parameter = c(&amp;quot;mu&amp;quot;, &amp;quot;sigma_a2&amp;quot;, &amp;quot;sigma_e2&amp;quot;),
  estimate  = c(reml$mu, reml$sigma_a2, reml$sigma_e2),
  SE        = c(se_mu_r, se_sa2_r, se_se2_r),
  lower95   = c(reml$mu - 1.96*se_mu_r, exp(fit_reml$par[1] - 1.96*se_theta_r[1]),
                exp(fit_reml$par[2] - 1.96*se_theta_r[2])),
  upper95   = c(reml$mu + 1.96*se_mu_r, exp(fit_reml$par[1] + 1.96*se_theta_r[1]),
                exp(fit_reml$par[2] + 1.96*se_theta_r[2]))
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   parameter  estimate        SE  lower95  upper95
## 1        mu 10.117306 0.6782608 8.787915 11.44670
## 2  sigma_a2  3.456533 2.1772520 1.005689 11.88004
## 3  sigma_e2  9.150746 1.5468265 6.570043 12.74515&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;strengths-weaknesses-assumptions-2&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Strengths, weaknesses, assumptions&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Strengths.&lt;/strong&gt; (Approximately, and for balanced data exactly) &lt;strong&gt;unbiased&lt;/strong&gt;
variance components; the de-facto standard for linear mixed models; better
small-sample calibration than ML.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Weaknesses.&lt;/strong&gt; The restricted likelihood depends on the fixed-effects design,
so &lt;strong&gt;you cannot use REML likelihoods to compare models with different fixed
effects&lt;/strong&gt; (likelihood-ratio tests and AIC across different &lt;span class=&#34;math inline&#34;&gt;\(\mathbf{X}\)&lt;/span&gt; are
invalid under REML, refit with ML for those). It still assumes normality, and
still gives only point + asymptotic-SE inference.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Assumptions used here.&lt;/strong&gt; Same normal model &lt;span class=&#34;citation&#34;&gt;@eq-marginal&lt;/span&gt;; the contrasts
&lt;span class=&#34;math inline&#34;&gt;\(\mathbf K^\top\mathbf y\)&lt;/span&gt; are exactly normal.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Typical applications.&lt;/strong&gt; Variance-component and heritability estimation,
animal/plant breeding, multilevel and longitudinal models, anywhere unbiased
variance estimates matter and the fixed-effects structure is fixed.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;sec-bayes&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Bayesian Estimation&lt;/h1&gt;
&lt;div id=&#34;philosophy-3&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Philosophy&lt;/h2&gt;
&lt;p&gt;The methods so far treated the parameters &lt;span class=&#34;math inline&#34;&gt;\(\mu,\sigma_a^2,\sigma_e^2\)&lt;/span&gt; as fixed
but unknown numbers, and asked which values the data point to. The Bayesian view
starts somewhere different: it treats every unknown (the parameters, and even the
unobserved group effects &lt;span class=&#34;math inline&#34;&gt;\(a_i\)&lt;/span&gt;) as a &lt;strong&gt;random variable&lt;/strong&gt; described by a
probability distribution. This does not claim the parameter is physically random;
the distribution simply encodes how sure we are about its value.&lt;/p&gt;
&lt;p&gt;Estimation then becomes &lt;em&gt;learning from data&lt;/em&gt;. We begin with a &lt;strong&gt;prior&lt;/strong&gt;
distribution &lt;span class=&#34;math inline&#34;&gt;\(p(\boldsymbol\theta)\)&lt;/span&gt;, our beliefs before seeing the data. The data
enter through the &lt;strong&gt;likelihood&lt;/strong&gt; &lt;span class=&#34;math inline&#34;&gt;\(p(\mathbf y\mid\boldsymbol\theta)\)&lt;/span&gt;, the very same
density we maximised for MLE. Bayes’ theorem combines the two into a
&lt;strong&gt;posterior&lt;/strong&gt; distribution &lt;span class=&#34;math inline&#34;&gt;\(p(\boldsymbol\theta\mid\mathbf y)\)&lt;/span&gt;, our updated beliefs
after seeing the data:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
p(\boldsymbol\theta\mid\mathbf y)
= \frac{p(\mathbf y\mid\boldsymbol\theta)\,p(\boldsymbol\theta)}{p(\mathbf y)}
\;\propto\;
\underbrace{p(\mathbf y\mid\boldsymbol\theta)}_{\text{likelihood}}\;
\underbrace{p(\boldsymbol\theta)}_{\text{prior}} .
\]&lt;/span&gt; {#eq-bayes}&lt;/p&gt;
&lt;p&gt;In words: the posterior is proportional to the likelihood times the prior. The
answer is not a single number but a whole distribution. We summarise it by its
centre (the posterior mean or median, which serves as a point estimate) and its
spread (a 95% &lt;strong&gt;credible interval&lt;/strong&gt;, the range that holds the parameter with 95%
posterior probability). That phrase is worth pausing on: a credible interval
means exactly what people usually, and incorrectly, believe a confidence interval
means.&lt;/p&gt;
&lt;p&gt;The one awkward piece is the denominator &lt;span class=&#34;math inline&#34;&gt;\(p(\mathbf y)\)&lt;/span&gt;, a normalising constant
that is almost always impossible to compute directly. The idea that makes modern
Bayesian analysis practical is to sidestep it entirely by &lt;em&gt;drawing samples&lt;/em&gt; from
the posterior rather than writing it down in closed form.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;model-priors-and-conjugacy&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Model, priors, and conjugacy&lt;/h2&gt;
&lt;p&gt;We write the model in its hierarchical form, keeping the group effects &lt;span class=&#34;math inline&#34;&gt;\(a_i\)&lt;/span&gt; in as
unknowns to be estimated alongside everything else:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\begin{aligned}
y_{ij}\mid \mu, a_i, \sigma_e^2 &amp;amp;\sim \mathcal N(\mu + a_i,\ \sigma_e^2), \\
a_i\mid\sigma_a^2 &amp;amp;\sim \mathcal N(0,\ \sigma_a^2),
\end{aligned}
\qquad\text{with priors}\qquad
\begin{aligned}
\mu &amp;amp;\sim \mathcal N(\mu_0,\ \gamma_0^2), \\
\sigma_e^2 &amp;amp;\sim \text{Inv-Gamma}(a_e, b_e), \\
\sigma_a^2 &amp;amp;\sim \text{Inv-Gamma}(a_a, b_a).
\end{aligned}
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;These particular prior families are not arbitrary; they are chosen to make the
updating clean. A prior is &lt;strong&gt;conjugate&lt;/strong&gt; to the likelihood when the resulting
posterior belongs to the &lt;em&gt;same family&lt;/em&gt; as the prior. Conjugacy is a convenience
rather than a requirement, but a powerful one: it means each update has an exact,
closed-form answer, so we can sample from it directly with no approximation. Two
classic conjugate pairs appear here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;Normal&lt;/strong&gt; prior on a mean, combined with Normal data, yields a Normal
posterior. That is why the grand mean &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt; receives a Normal prior.&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;Inverse-Gamma&lt;/strong&gt; prior on a variance, combined with Normal data, yields an
Inverse-Gamma posterior. The Inverse-Gamma also lives only on the positive
numbers, which is exactly what a variance requires. That is why &lt;span class=&#34;math inline&#34;&gt;\(\sigma_e^2\)&lt;/span&gt; and
&lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2\)&lt;/span&gt; receive Inverse-Gamma priors.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We use &lt;strong&gt;weakly informative&lt;/strong&gt; defaults that let the data dominate: a near-flat
&lt;span class=&#34;math inline&#34;&gt;\(\mathcal N(0,10^6)\)&lt;/span&gt; on &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\text{Inv-Gamma}(0.001,0.001)\)&lt;/span&gt; on each
variance. (More on this choice, and its known fragility for variance components,
below.)&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;deriving-the-gibbs-sampler&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Deriving the Gibbs sampler&lt;/h2&gt;
&lt;p&gt;We still cannot write the full posterior as one tidy distribution. But thanks to
conjugacy we &lt;em&gt;can&lt;/em&gt; write down, for each unknown, its &lt;strong&gt;full conditional&lt;/strong&gt;: the
distribution of that one parameter given the data and the current values of all
the others. Each full conditional turns out to be a familiar, easy-to-sample
family.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;Gibbs sampler&lt;/strong&gt; exploits this. Starting from any reasonable values, it
visits the unknowns one at a time, replacing each with a fresh draw from its full
conditional while holding the others fixed, and then repeats this cycle thousands
of times. Remarkably, the values it visits settle into genuine draws from the
joint posterior, and the intractable constant &lt;span class=&#34;math inline&#34;&gt;\(p(\mathbf y)\)&lt;/span&gt; never has to be
computed. The four full conditionals, with &lt;span class=&#34;math inline&#34;&gt;\(\bar y_{i\cdot}\)&lt;/span&gt; the &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt;-th group
mean, are as follows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Group effects&lt;/strong&gt; (a precision-weighted shrinkage toward zero):&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
a_i \mid \cdots \;\sim\;
\mathcal N\!\left(
\frac{(n/\sigma_e^2)(\bar y_{i\cdot}-\mu)}{n/\sigma_e^2 + 1/\sigma_a^2},
\;\;
\frac{1}{n/\sigma_e^2 + 1/\sigma_a^2}
\right).
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;In plain terms:&lt;/em&gt; each group’s effect is estimated as a compromise between what
that group’s own data say (&lt;span class=&#34;math inline&#34;&gt;\(\bar y_{i\cdot}-\mu\)&lt;/span&gt;) and zero, weighted by how much
data the group has against how variable groups are. Groups with little data, or
data that look like everyone else’s, get pulled harder toward zero. This automatic
&lt;em&gt;shrinkage&lt;/em&gt; (the same idea as a BLUP in breeding work) is one of the most useful
features of hierarchical models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Grand mean&lt;/strong&gt; (a Normal prior updated by Normal data gives a Normal posterior):&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\mu \mid \cdots \;\sim\;
\mathcal N\!\left(
\frac{\mu_0/\gamma_0^2 + \big(\sum_{i,j}(y_{ij}-a_i)\big)/\sigma_e^2}
     {1/\gamma_0^2 + N/\sigma_e^2},
\;\;
\frac{1}{1/\gamma_0^2 + N/\sigma_e^2}
\right).
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;In plain terms:&lt;/em&gt; the updated mean is a weighted average of the prior guess
&lt;span class=&#34;math inline&#34;&gt;\(\mu_0\)&lt;/span&gt; and the data average (after the group effects are removed), each weighted
by its &lt;strong&gt;precision&lt;/strong&gt;, which is one over its variance, that is, how informative it
is. A near-flat prior carries almost no precision, so the data effectively decide
&lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Residual variance&lt;/strong&gt; (Inverse-Gamma prior, Normal data, Inverse-Gamma posterior):&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\sigma_e^2 \mid \cdots \;\sim\;
\text{Inv-Gamma}\!\left(a_e + \tfrac{N}{2},\;\;
b_e + \tfrac12\sum_{i,j}(y_{ij}-\mu-a_i)^2\right).
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;In plain terms:&lt;/em&gt; the posterior keeps the Inverse-Gamma shape, with its second
parameter increased by the leftover within-group variation
&lt;span class=&#34;math inline&#34;&gt;\(\sum(y_{ij}-\mu-a_i)^2\)&lt;/span&gt;. More residual scatter pushes &lt;span class=&#34;math inline&#34;&gt;\(\sigma_e^2\)&lt;/span&gt; higher,
exactly as intuition demands.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Between-group variance&lt;/strong&gt; (the same Inverse-Gamma conjugacy):&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\sigma_a^2 \mid \cdots \;\sim\;
\text{Inv-Gamma}\!\left(a_a + \tfrac{g}{2},\;\;
b_a + \tfrac12\sum_{i} a_i^2\right).
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;In plain terms:&lt;/em&gt; identical in spirit, now driven by how spread out the estimated
group effects &lt;span class=&#34;math inline&#34;&gt;\(a_i\)&lt;/span&gt; are. Widely differing groups push &lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2\)&lt;/span&gt; up;
near-identical groups pull it toward zero.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Sample from Inv-Gamma(shape, rate) as 1 / Gamma(shape, rate).
gibbs_one_way &amp;lt;- function(y, group, n_iter = 12000, burn = 2000,
                          mu0 = 0, g0_2 = 1e6,
                          a_e = 1e-3, b_e = 1e-3,
                          a_a = 1e-3, b_a = 1e-3, seed = 1) {
  set.seed(seed)
  group  &amp;lt;- factor(group); lev &amp;lt;- levels(group)
  g &amp;lt;- length(lev); N &amp;lt;- length(y); gidx &amp;lt;- as.integer(group)
  n_i    &amp;lt;- as.numeric(table(group))
  ybar_i &amp;lt;- as.numeric(tapply(y, group, mean))

  draws &amp;lt;- matrix(NA_real_, n_iter, 3,
                  dimnames = list(NULL, c(&amp;quot;mu&amp;quot;, &amp;quot;sigma_a2&amp;quot;, &amp;quot;sigma_e2&amp;quot;)))

  mu &amp;lt;- mean(y); se2 &amp;lt;- var(y) / 2; sa2 &amp;lt;- var(y) / 2; a &amp;lt;- numeric(g)  # init

  for (t in seq_len(n_iter)) {
    ## 1) group effects a_i
    prec_a &amp;lt;- n_i / se2 + 1 / sa2
    a &amp;lt;- rnorm(g, mean = (n_i / se2) * (ybar_i - mu) / prec_a,
                  sd   = sqrt(1 / prec_a))
    ## 2) grand mean mu
    prec_mu &amp;lt;- 1 / g0_2 + N / se2
    s_resid &amp;lt;- sum(n_i * (ybar_i - a))            # = sum_ij (y_ij - a_i)
    mu &amp;lt;- rnorm(1, mean = (mu0 / g0_2 + s_resid / se2) / prec_mu,
                   sd   = sqrt(1 / prec_mu))
    ## 3) residual variance sigma_e^2
    rss &amp;lt;- sum((y - mu - a[gidx])^2)
    se2 &amp;lt;- 1 / rgamma(1, shape = a_e + N / 2, rate = b_e + rss / 2)
    ## 4) between variance sigma_a^2
    sa2 &amp;lt;- 1 / rgamma(1, shape = a_a + g / 2, rate = b_a + sum(a^2) / 2)

    draws[t, ] &amp;lt;- c(mu, sa2, se2)
  }
  draws[(burn + 1):n_iter, , drop = FALSE]
}

post &amp;lt;- gibbs_one_way(y, group)
post_icc &amp;lt;- post[, &amp;quot;sigma_a2&amp;quot;] / (post[, &amp;quot;sigma_a2&amp;quot;] + post[, &amp;quot;sigma_e2&amp;quot;])
nrow(post)   # retained posterior draws&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 10000&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;did-the-sampler-work-trace-plots&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Did the sampler work? Trace plots&lt;/h2&gt;
&lt;p&gt;Before trusting any summary we check that the chains mixed and reached a
stationary regime (no trend, good exploration).&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:fig-trace&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://agprophet.netlify.app/project/estimation/index_en_files/figure-html/fig-trace-1.png&#34; alt=&#34;Trace plots of the retained draws. Stationary, well-mixed &#39;fuzzy caterpillars&#39; indicate the sampler is exploring the posterior rather than drifting.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 6: Trace plots of the retained draws. Stationary, well-mixed ‘fuzzy caterpillars’ indicate the sampler is exploring the posterior rather than drifting.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;posterior-distributions-and-credible-intervals&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Posterior distributions and credible intervals&lt;/h2&gt;
&lt;p&gt;The posterior &lt;em&gt;is&lt;/em&gt; the inference. We summarise each margin by its mean and a
&lt;strong&gt;95% equal-tailed credible interval&lt;/strong&gt;, the central interval containing 95% of
the posterior probability.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:fig-posterior&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://agprophet.netlify.app/project/estimation/index_en_files/figure-html/fig-posterior-1.png&#34; alt=&#34;Marginal posteriors. Solid line = posterior mean; shaded band = 95% credible interval; dashed line = the true value. The posteriors comfortably cover the truth.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 7: Marginal posteriors. Solid line = posterior mean; shaded band = 95% credible interval; dashed line = the true value. The posteriors comfortably cover the truth.
&lt;/p&gt;
&lt;/div&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bayes_summary &amp;lt;- function(x)
  c(mean = mean(x), median = median(x),
    lower95 = unname(quantile(x, 0.025)), upper95 = unname(quantile(x, 0.975)))

bayes &amp;lt;- rbind(
  mu       = bayes_summary(post[, &amp;quot;mu&amp;quot;]),
  sigma_a2 = bayes_summary(post[, &amp;quot;sigma_a2&amp;quot;]),
  sigma_e2 = bayes_summary(post[, &amp;quot;sigma_e2&amp;quot;]),
  ICC      = bayes_summary(post_icc)
)
round(bayes, 3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##            mean median lower95 upper95
## mu       10.111 10.125   8.650  11.525
## sigma_a2  4.137  3.282   0.607  13.201
## sigma_e2  9.575  9.368   6.802  13.531
## ICC       0.277  0.261   0.051   0.594&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice the ICC row: with the full posterior in hand, uncertainty for &lt;strong&gt;any
function&lt;/strong&gt; of the parameters comes for free: we just transform each draw, with
no delta-method approximation. This is a genuine practical advantage of the
Bayesian machinery.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;how-a-prior-updates-into-a-posterior&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;How a prior updates into a posterior&lt;/h2&gt;
&lt;p&gt;The defining Bayesian act is &lt;em&gt;updating&lt;/em&gt;. To see it without distraction, give
&lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt; a deliberately &lt;strong&gt;informative&lt;/strong&gt; (and deliberately wrong) prior centred at 8,
then watch the data drag the posterior toward the truth near 10.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;post_inform &amp;lt;- gibbs_one_way(y, group, mu0 = 8, g0_2 = 1, seed = 7)  # tight wrong prior
post_vague  &amp;lt;- post                                                  # the main, vague-prior run

xg &amp;lt;- seq(6, 12, length.out = 400)
prior_dens &amp;lt;- dnorm(xg, mean = 8, sd = 1)                            # the informative prior

upd &amp;lt;- rbind(
  data.frame(mu = xg, dens = prior_dens, kind = &amp;quot;Prior  N(8, 1)&amp;quot;),
  data.frame(mu = density(post_inform[, &amp;quot;mu&amp;quot;], from = 6, to = 12, n = 400)$x,
             dens = density(post_inform[, &amp;quot;mu&amp;quot;], from = 6, to = 12, n = 400)$y,
             kind = &amp;quot;Posterior (informative prior)&amp;quot;),
  data.frame(mu = density(post_vague[, &amp;quot;mu&amp;quot;], from = 6, to = 12, n = 400)$x,
             dens = density(post_vague[, &amp;quot;mu&amp;quot;], from = 6, to = 12, n = 400)$y,
             kind = &amp;quot;Posterior (vague prior) ~ likelihood&amp;quot;)
)&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:fig-update&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://agprophet.netlify.app/project/estimation/index_en_files/figure-html/fig-update-1.png&#34; alt=&#34;Belief updating for $\mu$. A confident-but-wrong prior centred at 8 (grey) is pulled by the data toward $\approx 10$ (pink). With a vague prior, the posterior essentially equals the likelihood (orange dashed). The data dominate the wrong prior.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 8: Belief updating for &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt;. A confident-but-wrong prior centred at 8 (grey) is pulled by the data toward &lt;span class=&#34;math inline&#34;&gt;\(\approx 10\)&lt;/span&gt; (pink). With a vague prior, the posterior essentially equals the likelihood (orange dashed). The data dominate the wrong prior.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;a-word-on-variance-priors-and-why-it-matters&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;A word on variance priors (and why it matters)&lt;/h2&gt;
&lt;p&gt;The &lt;span class=&#34;math inline&#34;&gt;\(\text{Inv-Gamma}(\epsilon,\epsilon)\)&lt;/span&gt; prior is conjugate and convenient, but
Gelman (2006) showed it can be &lt;strong&gt;surprisingly informative&lt;/strong&gt; for hierarchical
variances: when the number of groups is small or the true variance is near zero,
the posterior can be sensitive to &lt;span class=&#34;math inline&#34;&gt;\(\epsilon\)&lt;/span&gt;. A robust modern default is a
half-Normal or half-Cauchy prior on the &lt;em&gt;standard deviation&lt;/em&gt; &lt;span class=&#34;math inline&#34;&gt;\(\sigma_a\)&lt;/span&gt;. We can
probe sensitivity here by re-running with a different inverse-gamma scale:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;post_alt &amp;lt;- gibbs_one_way(y, group, a_a = 2, b_a = 4, seed = 3)  # IG(2, 4): mean ~ 4
rbind(
  `IG(0.001,0.001)` = c(sigma_a2_mean = mean(post[, &amp;quot;sigma_a2&amp;quot;]),
                        sigma_a2_med  = median(post[, &amp;quot;sigma_a2&amp;quot;])),
  `IG(2,4)`         = c(sigma_a2_mean = mean(post_alt[, &amp;quot;sigma_a2&amp;quot;]),
                        sigma_a2_med  = median(post_alt[, &amp;quot;sigma_a2&amp;quot;]))
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                 sigma_a2_mean sigma_a2_med
## IG(0.001,0.001)      4.137276     3.282043
## IG(2,4)              3.191633     2.748499&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With 10 groups the two priors broadly agree, but the gap would widen with fewer
groups, which is why you should always report prior sensitivity for variance
components.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;strengths-weaknesses-assumptions-3&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Strengths, weaknesses, assumptions&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Strengths.&lt;/strong&gt; A full distribution rather than a point; exact (simulation-based)
uncertainty for &lt;em&gt;any&lt;/em&gt; function of the parameters; principled incorporation of
external knowledge; automatic, interpretable shrinkage of group effects; valid
in small samples without asymptotics; coherent propagation of uncertainty
through to predictions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Weaknesses.&lt;/strong&gt; Requires priors (a feature &lt;em&gt;and&lt;/em&gt; a responsibility); results can
be prior-sensitive for variances; computationally heavier; needs convergence
diagnostics; the interpretation differs fundamentally from frequentist output.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Assumptions used here.&lt;/strong&gt; The likelihood &lt;span class=&#34;citation&#34;&gt;@eq-marginal&lt;/span&gt; and the specified
priors; for the Gibbs sampler, conditional conjugacy; for valid summaries,
convergence of the chain.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Typical applications.&lt;/strong&gt; Hierarchical/multilevel modelling, small-sample or
sparse-data problems, evidence synthesis, decision analysis, and anywhere
full uncertainty propagation or prior information is valuable.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr /&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;sec-synthesis&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Synthesis and comparison&lt;/h1&gt;
&lt;p&gt;We applied four estimation logics to one dataset whose truth we know. Now we put
them side by side.&lt;/p&gt;
&lt;div id=&#34;point-estimates&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Point estimates&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;##   parameter  Truth    MoM    MLE   REML  Bayes
## 1        mu 10.000 10.117 10.117 10.117 10.111
## 2  sigma_a2  4.000  3.457  2.997  3.457  4.137
## 3  sigma_e2  9.000  9.150  9.150  9.151  9.575
## 4       ICC  0.308  0.274  0.247  0.274  0.277&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The pattern is exactly the theory: &lt;strong&gt;MoM = REML&lt;/strong&gt; (both divide &lt;span class=&#34;math inline&#34;&gt;\(SS_A\)&lt;/span&gt; by &lt;span class=&#34;math inline&#34;&gt;\(g-1\)&lt;/span&gt;);
&lt;strong&gt;MLE&lt;/strong&gt; sits below them on &lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2\)&lt;/span&gt; (divides by &lt;span class=&#34;math inline&#34;&gt;\(g\)&lt;/span&gt;, the finite-sample
downward bias); the &lt;strong&gt;Bayesian&lt;/strong&gt; posterior mean lands near REML but is pulled a
touch higher by the prior and by averaging over a right-skewed posterior. All
four agree closely on &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\sigma_e^2\)&lt;/span&gt;, where there is no degrees-of-
freedom subtlety.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;variance-estimates-and-uncertainty-side-by-side&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Variance estimates and uncertainty, side by side&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;## Warning: The `fatten` argument of `geom_pointrange()` is deprecated as of ggplot2 4.0.0.
## ℹ Please use the `size` aesthetic instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:fig-forest&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://agprophet.netlify.app/project/estimation/index_en_files/figure-html/fig-forest-1.png&#34; alt=&#34;Point estimates with uncertainty for each parameter and method. Bars are 95% intervals (Wald/χ² for the frequentist methods, credible for Bayes). Dashed line = truth. The methods agree on $\mu$ and $\sigma_e^2$ and diverge, as theory predicts, on the between-group variance $\sigma_a^2$.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 9: Point estimates with uncertainty for each parameter and method. Bars are 95% intervals (Wald/χ² for the frequentist methods, credible for Bayes). Dashed line = truth. The methods agree on &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\sigma_e^2\)&lt;/span&gt; and diverge, as theory predicts, on the between-group variance &lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2\)&lt;/span&gt;.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;a-structured-comparison&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;A structured comparison&lt;/h2&gt;
&lt;table&gt;
&lt;caption&gt;&lt;span id=&#34;tab:synth-table&#34;&gt;Table 1: &lt;/span&gt;Conceptual comparison of the four frameworks.&lt;/caption&gt;
&lt;colgroup&gt;
&lt;col width=&#34;31%&#34; /&gt;
&lt;col width=&#34;19%&#34; /&gt;
&lt;col width=&#34;12%&#34; /&gt;
&lt;col width=&#34;19%&#34; /&gt;
&lt;col width=&#34;16%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Aspect&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;MoM&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;MLE&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;REML&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Bayes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Point estimate of variance components&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;MS_A, MS_E solved&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Maximise ℓ&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Maximise ℓ_R&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Posterior mean/median&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Treatment of uncertainty&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Approx / χ²-based&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Fisher information&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Information from ℓ_R&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Full posterior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Interval interpretation&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Confidence&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Confidence&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Confidence&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Credible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Finite-sample bias (variance comps.)&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Unbiased (balanced)&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Downward biased&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Reduced / unbiased (balanced)&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Depends on prior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Distributional assumption&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Moments only&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Full likelihood&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Full likelihood&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Likelihood + prior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Computational cost&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Trivial (closed form)&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Low to moderate&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Low to moderate&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;High (MCMC)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Compares models w/ different fixed effects?&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;N/A&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Yes (LRT, AIC)&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;No (refit with ML)&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Yes (Bayes factors / IC)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Uncertainty for functions of params (e.g. ICC)&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Awkward (delta/Satterthwaite)&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Delta method&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Delta method&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Exact from draws&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;frequentist-versus-bayesian-interpretation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Frequentist versus Bayesian interpretation&lt;/h2&gt;
&lt;p&gt;The single most important distinction is what an interval &lt;em&gt;means&lt;/em&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;95% confidence interval&lt;/strong&gt; (MoM, MLE, REML) is a statement about the
&lt;em&gt;procedure&lt;/em&gt;: across hypothetical repetitions of the experiment, 95% of the
intervals so constructed would contain the fixed, unknown &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;. The
interval is random; &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; is fixed. You may &lt;strong&gt;not&lt;/strong&gt; say “there is a 95%
probability &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; is in &lt;em&gt;this&lt;/em&gt; interval.”&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;95% credible interval&lt;/strong&gt; (Bayes) is a statement about the &lt;em&gt;parameter&lt;/em&gt;: given
this data, prior, and model, there is a 95% posterior probability that &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;
lies in the interval. &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; is random; the interval is fixed once data are
observed. This is the statement people &lt;em&gt;wish&lt;/em&gt; a confidence interval made, and
only the Bayesian framework licenses it.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These often &lt;em&gt;look&lt;/em&gt; numerically similar (compare the bars in &lt;span class=&#34;citation&#34;&gt;@fig-forest&lt;/span&gt;), but
they answer different questions, and they can diverge sharply with small samples,
strong priors, or near a boundary like &lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2 = 0\)&lt;/span&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;which-method-when&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Which method, when?&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Method of Moments&lt;/strong&gt;: when you need a fast answer, a starting value for an
iterative fit, or you are unwilling to assume a full distribution (the GMM
spirit). Excellent for balanced designs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Maximum Likelihood&lt;/strong&gt;: your default for point estimation and testing,
especially when you must &lt;strong&gt;compare models with different fixed effects&lt;/strong&gt; (LRTs,
AIC) or when the sample is large enough that finite-sample variance bias is
negligible.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;REML&lt;/strong&gt;: the standard for &lt;strong&gt;variance components and mixed models&lt;/strong&gt; whenever
unbiased variance estimates matter and the fixed-effects structure is settled;
prefer it over ML for the final variance estimates, but switch back to ML to
compare fixed-effects specifications.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bayesian&lt;/strong&gt;: when the sample is small or the model deeply hierarchical, when
you have genuine prior information, when you need exact uncertainty for derived
quantities (predictions, ICCs, ratios), or when full uncertainty propagation
into a downstream decision is the goal.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;the-takeaway&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The takeaway&lt;/h2&gt;
&lt;p&gt;All four methods answer the same question, “what parameter values fit this
data?”, and they differ only in how they ask it. On our balanced design that
difference collapsed to a single arithmetic choice: dividing the between-group
sum of squares by &lt;span class=&#34;math inline&#34;&gt;\(g\)&lt;/span&gt; (ML) or by &lt;span class=&#34;math inline&#34;&gt;\(g-1\)&lt;/span&gt; (MoM, REML, and effectively the
Bayesian posterior). They agreed where the problem is easy, on &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt; and
&lt;span class=&#34;math inline&#34;&gt;\(\sigma_e^2\)&lt;/span&gt;, and disagreed exactly where a degree of freedom and the prior have
something to say, on &lt;span class=&#34;math inline&#34;&gt;\(\sigma_a^2\)&lt;/span&gt;. Knowing &lt;em&gt;why&lt;/em&gt; they disagree is what lets you
read any of these outputs with confidence.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
