最小平方估计量的性质

\begin{matrix} (1) & {\begin{cases} Y = X^{T} β + e \\ E (e | X) = 0 \end{cases} \\ (2) & {\begin{cases} E (Y^{2}) < \infty \\ E (‖ X ‖^{2}) < \infty \\ E (X X^{T}) ≻ O \end{cases} \end{matrix}

最小平方估计量(estimator)#矩阵形式的情形为

\hat{β} \equiv {(X^{T} X)}^{- 1} (X^{T} Y)

以下结论均假设样本是独立同分布的。

条件期望

样本模型和总体模型一致

Y_{i} = X_{i}^{T} β + e_{i}

取条件期望可得

E [Y_{i} ∣ X] = X_{i}^{T} β + E [e_{i} | X]

从而有

E [Y ∣ X] = [\begin{matrix} ⋮ \\ E [Y_{i} ∣ X] \\ ⋮ \end{matrix}] = [\begin{matrix} ⋮ \\ X_{i}^{T} β + E [e | X] \\ ⋮ \end{matrix}] = X β + E [e ∣ X]

最小平方估计量的条件期望为

\begin{aligned} E [\hat{β} ∣ X] & = E [(X^{T} X)^{- 1} (X^{T} Y) ∣ X] \\ = (X^{T} X)^{- 1} X^{T} E [Y ∣ X] \\ = (X^{T} X)^{- 1} X^{T} (X β + E [e ∣ X]) \end{aligned}

当且仅当 $E (e | X) = 0$ 时（使用线性CEF模型的假设）

E [\hat{β} ∣ X] = β

我们称这种性质为无偏性（unbiasness）。

If $(X, e)$ have a joint normal distribution^[1]，根据期望迭代法则有

E (β) = E [E [\hat{β} ∣ X] ∣ X] = E [β ∣ X] = β

条件方差

最小平方估计量的条件方差为

\begin{aligned} V a r [\hat{β} ∣ X] & = V a r [(X^{T} X)^{- 1} (X^{T} Y) | X] \\ = (X^{T} X)^{- 1} X^{T} V a r [Y ∣ X] X (X^{T} X)^{- 1} \\ = (X^{T} X)^{- 1} X^{T} V a r [X β + e ∣ X] X (X^{T} X)^{- 1} \\ = (X^{T} X)^{- 1} X^{T} V a r [e | X] X (X^{T} X)^{- 1} \end{aligned}

为了表示方便，定义

Ω \equiv V a r (e ∣ X)

当且仅当 $E (e | X) = 0$ 时（使用线性CEF模型的假设）

Ω = E [e e^{T} ∣ X] - E [e ∣ X] E [e ∣ X]^{T} = E [e e^{T} ∣ X] = [\begin{matrix} σ_{11}^{2} & \dots & σ_{1 n}^{2} \\ ⋮ & ⋱ & ⋮ \\ σ_{n 1}^{2} & \dots & σ_{n n}^{2} \end{matrix}]

其中，主对角线元素为样本误差的方差，其他元素为样本误差之间的协方差。

\begin{aligned} σ_{i i}^{2} & = E [e_{i}^{2} | X] = E [e_{i}^{2} | X] - E [e_{i} ∣ X]^{2} = V a r (e_{i} ∣ X) \\ σ_{i j}^{2} & = E [e_{i} e_{j} ∣ X] = E [e_{i} e_{j} ∣ X] - E [e_{i} ∣ X] E [e_{j} ∣ X] = C o v (e_{i}, e_{j} ∣ X) \end{aligned}

因此， $Ω$ 也称为样本误差的方差 - 协方差矩阵。

由于 $Ω$ 矩阵包含足足 $n \times n$ 个未知参数，我们常常使用几种假设减少未知参数数量：

主对角线元素
- 同质性（homogeneity）： $σ_{i i}^{2} = σ^{2}$
- 异质性（heterogeneity）： $σ_{i i}^{2} = σ_{i}^{2}$
其他元素
- 无关性（uncorrelated）： $σ_{i j}^{2} = 0$
- 相关性（correlated）： $σ_{i j}^{2} \neq 0$
  - 对于分组数据，若 $C o v (e_{i, g}, e_{j, g}) \neq 0$ ，称分组 $g$ 存在聚类相关（cluster-correlated）
  - 对于时序数据，若 $e_{t} = ρ e_{t - 1} + ε$ ，称时序存在（一阶）自相关（auto-correlated）

在样本独立同分布条件下，其他元素天然满足无关性，所以两种组合假设最为常用：

同质性 + 无关性=同方差（homoskedasticity）

Ω = [\begin{matrix} σ^{2} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & σ^{2} \end{matrix}] = σ^{2} I_{n}

异质性 + 无关系=异方差（heteroskedasticity）

Ω = [\begin{matrix} σ_{1}^{2} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & σ_{n}^{2} \end{matrix}]

例如，同方差假设下最小平方估计量的条件方差为

V a r [\hat{β} ∣ X] = (X^{T} X)^{- 1} σ^{2}

事实上，这是 $β$ 的一族估计量所能达到的最小方差。

Gauss-Markov Theorem Take the homoskedastic linear regression model. If $\tilde{β}$ is an linear unbiased estimator of $β$ , then

V a r [\tilde{β} ∣ X] \geq (X^{T} X)^{- 1} σ^{2}

即最小平方估计量在所有线性无偏估计量中条件方差最小。

值得注意的是，即使我们采用最为便利的同方差假设，也仅能将 $Ω$ 矩阵的未知参数减少到只剩 $σ^{2}$ ，我们仍未解决如何使用样本数据估计条件方差的问题，进一步讨论详见回归误差参数的估计。

经典线性模型假设

至此，我们已经走过了从构建总体线性回归模型到得出样本估计量及其性质的整个生命周期。总结一下相关假设。

序号	假设	内涵
①	线性模型	$Y = X β + e$
②	均值独立	$E [e ∣ X] = 0$
③	随机抽样	样本独立同分布
④	解释变量不完全共线	$E (X X^{T}) is positive definite.$
⑤	误差项满足同方差假设	$V a r [e ∣ X] = σ^{2} I_{n}$
⑥	误差项服从正态分布	$e ∣ X \sim N (0, σ^{2} I_{n})$

Tip

①-⑤称为 Gauss–Markov assumptions
①-⑥称为 classical linear model assumptions

①-④即可得出无偏性。⑤可得出参数估计量的理想条件方差但并不现实，因此有必要使用各种稳健标准误。⑥用于统计推断但直接假设缺乏说服力，最好借助中心极限定理和渐进理论。

flowchart LR
    统计推断-->标准误
    统计推断-->正态分布
    标准误--同方差假设-->简单标准误
    标准误--异方差假设-->稳健标准误
    正态分布-->大胆假设
    正态分布-->渐进理论

Quote

The CLM assumptions are very strong, and a primary focus in theoretical and applied econometrics has been to conduct inference using OLS in a variety of settings – cross-sectional data, time series data, panel data, and data with a spatial structure – while imposing few assumptions. It is very difficult to get anywhere without relying on asymptotics. Therefore, we replace the CLM assumptions and rely on application of the law of large numbers and central limit theorem.(Wooldridge,2023)

渐进性质

这是一个充分不必要条件；如果 $X$ 服从离散型分布， $\hat{β}$ 的期望和方差可能不存在。 ↩︎