Notes 2023.09.11

Multiple random variables redux:

Rank correlation coefficient: $r' = 1 - 6 \sum_{i} \frac{d_i^2}{N(N^2-1)}$

Joint distribution: $\probp([x_1, x_2] \times [y_1, y_2]) = \iint_{I_x \times I_y} p(x, y) \intd{y} \intd{x}$

NMarginal distribution: $\probp([x_1, x_2]) = \int_{x_1}^{x_2} \int_{\mathbb{R}} p(x, y) \intd{y} \intd{x}$

e.g. for a bivariate normal $p(x, y) = \frac{1}{\pi \sigma_1 \sigma_2 \sqrt{1-r^2}} \exp \left[-\frac{1}{2(1-r)^2}\left(\frac{(x-\mu_1)^2}{\sigma_1^2} + \frac{(y-\mu_2)^2}{\sigma_2^2} - 2r \frac{(x-\mu_1)(x-\mu_2)}{\sigma_1 \sigma_2} \right) \right]$ .

"Independence":

Statistical independence: $p(x, y) = p(x)p(y)$ . I.e. the probability of events
Linear independence: $r=0$ , i.e. principal axes of level sets of pdf are orthogonal.
Physical independence: causal statement from domain knowledge.

Estimation of pdfs:

Histograms: choice of parameters (e.g., bin size)
- Naively can estimate sensitivity to bin size (even in eyeball norm!)
- Exercise: take a large-ish climate dataset (e.g., 300hPa tropical relative humidity) ~300,000 samples. Take 3000 datapoint subset. Make histograms.
Kernel density estimation
- Many parameters, many methods.
- Can get fairly rigorous convergence results under mild assumptions (can these be tested directly on data?)

Correlation and causality:

A statistically significant correlation should be analyzed in context of, e.g., length of data record.
- A simple scatter plot can serve as a gut check (n.b. I use this instead of sanity check) for correctness.
- Do you have a physically plausible interpretation of correlation? A curious correlation can serve as a start of inquiry, but it is almost never proof in and of itself.

Lecture 03: Statistics

Gamma function:

$\Gamma(z) \equiv \int_0^\infty t^{z-1} e^{-t} \intd{t}$ $\Gamma(n) = (n-1)! = \prod_{i=1}^{n-1} i$

Useful distributions:

Suppose we have an infinite population $\sim \mathcal{N}(\mu, \sigma)$ , then the standard deviation of the average of $N$ independent samples is $\frac{\sigma}{\sqrt{N}}$
Z-statistics (one variable) $z = \frac{\bar{x}-\mu}{\sigma_{\bar{x}}} = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{N}}}$ This can be analytically written in terms of the gamma function! $f_n(z) = \frac{\Gamma\left(\frac{n}{2}\right)}{\sqrt{\pi} \Gamma\left( \frac{n-1}{2} \right)} \left(1 + z^2 \right)^{-\frac{n}{2}}$ which is crucial for constructing significance tests!
Z-statistics: (two variable) $z = \frac{(\bar{x}_1 - \mu_1 ) - (\bar{x}_2 - \mu_2)}{\sqrt{\frac{\sigma_1^2}{N_1} + \frac{\sigma_2^2}{N_2}}}$

This works in the sense that $\bar{x}$ is a probability distribution on $S^N$ which is an estimator for $\mu$ . The variance diminishes as we take more I.I.D. samples.

Student's t distribution:

$\begin{align*} t &\equiv \frac{\bar{x} - \mu}{\frac{s}{\sqrt{N}}} \\ s &\equiv \sqrt{\frac{1}{N-1} \sum_{i} (x_i-\bar{x})^2} \end{align*}$

and we get $f_{r=N-1}(t) = \frac{\Gamma\left[\frac{r+1}{2} \right]}{\sqrt{r\pi} \Gamma\left(\frac{r}{2} \right) \left(1 + \frac{t^2}{r} \right)^{\frac{1}{2}(r+1)}}$ and $\probe[t_{N-1}] = 0$

$\chi^2$ distribution:

For a given $\sigma$ $\chi_{N-1}^2 = (N-1) \frac{s^2}{\sigma^2}$ and we find $p_{r=N-1} = \frac{x^{0.5r - 1}e^{-x/2}}{\Gamma\left(\frac{r}{2} \right)2^{0.5r}}$ and we conclude $\probe[\chi_{N-1}] = N-1, \quad \sigma_{N-1} = \sqrt{2(N-1)}$

F distribution:

Two independent variables with $\chi^2$ distributions with d.o.f $n, m$ resp. then $F_{n,m} = \frac{\chi_n^2/n}{\chi_m^2/m}$ and we once again get an analytic PDF, expectation, and variance. Expectation and variance depend only on $n, m$ .

Broad picture:

Z-statistics: tests observed mean, increasing sample gives convergence of variance
Student-t test: tests observed mean, std. dev.
$\chi^2$ : tests observed variance against measured variance
F distribution: test two observed variances against each other.

Confidence intervals:

Assume $N$ independent samples drawn from a normal distribution with unknown expectation.

Denote mean as $\bar{x}$ .
What is the interval $I$ that the true mean $\mu$ is expected to fall in with $\probp(\mu \in I) > 0.95$ .
For two points, $t_{-.025}, t_{0.025}$ cumulative PDF
$\probp(t_{-0.025}(N-1)) = 0.025,$ $\probp(t_{0.025}(N-1)) = 1 - 0.025$ and therefore $t_{-0.025}(N-1) \leq \frac{\bar{x} - \mu}{\frac{s}{\sqrt{N}}} \leq t_{0.025}(N-1)$ and so $\bar{x} - \frac{s}{\sqrt{n}}t_{0.025}(N-1) \leq \mu \leq \bar{x} - \frac{s}{\sqrt{N}} t_{-0.025}(N-1)$

If we want one-sided estimate, then $\probp(\mu \geq \cdot) = 1 - \alpha,$ then use $t_\alpha$ and discard $t_{1-\alpha}$ as above.

Hypothesis testing:

State significance level a priori.
State the null hypothesis and the algernative hypothesis
Perform double/one sided test.
Find appropriate statistics to use
Calculate statistics
Evaluate the calculation and accept/reject the hypothesis.

There is always a chance that you accept (reject) a false (true) statement.

Parent post:

CLaSP 586