Notes 2023.09.11

Multiple random variables redux:

Rank correlation coefficient: r=16idi2N(N21)r' = 1 - 6 \sum_{i} \frac{d_i^2}{N(N^2-1)}

Joint distribution: P([x1,x2]×[y1,y2])=Ix×Iyp(x,y)dydx \probp([x_1, x_2] \times [y_1, y_2]) = \iint_{I_x \times I_y} p(x, y) \intd{y} \intd{x}

NMarginal distribution: P([x1,x2])=x1x2Rp(x,y)dydx \probp([x_1, x_2]) = \int_{x_1}^{x_2} \int_{\mathbb{R}} p(x, y) \intd{y} \intd{x}

e.g. for a bivariate normal p(x,y)=1πσ1σ21r2exp[12(1r)2((xμ1)2σ12+(yμ2)2σ222r(xμ1)(xμ2)σ1σ2)] p(x, y) = \frac{1}{\pi \sigma_1 \sigma_2 \sqrt{1-r^2}} \exp \left[-\frac{1}{2(1-r)^2}\left(\frac{(x-\mu_1)^2}{\sigma_1^2} + \frac{(y-\mu_2)^2}{\sigma_2^2} - 2r \frac{(x-\mu_1)(x-\mu_2)}{\sigma_1 \sigma_2} \right) \right] .

"Independence":

  • Statistical independence: p(x,y)=p(x)p(y)p(x, y) = p(x)p(y) . I.e. the probability of events
  • Linear independence: r=0r=0, i.e. principal axes of level sets of pdf are orthogonal.
  • Physical independence: causal statement from domain knowledge.

Estimation of pdfs:

  • Histograms: choice of parameters (e.g., bin size)
    • Naively can estimate sensitivity to bin size (even in eyeball norm!)
    • Exercise: take a large-ish climate dataset (e.g., 300hPa tropical relative humidity) ~300,000 samples. Take 3000 datapoint subset. Make histograms.
  • Kernel density estimation
    • Many parameters, many methods.
    • Can get fairly rigorous convergence results under mild assumptions (can these be tested directly on data?)

Correlation and causality:

  • A statistically significant correlation should be analyzed in context of, e.g., length of data record.
    • A simple scatter plot can serve as a gut check (n.b. I use this instead of sanity check) for correctness.
    • Do you have a physically plausible interpretation of correlation? A curious correlation can serve as a start of inquiry, but it is almost never proof in and of itself.

Lecture 03: Statistics

Gamma function:

Γ(z)0tz1etdt \Gamma(z) \equiv \int_0^\infty t^{z-1} e^{-t} \intd{t} Γ(n)=(n1)!=i=1n1i \Gamma(n) = (n-1)! = \prod_{i=1}^{n-1} i

Useful distributions:

  • Suppose we have an infinite population N(μ,σ)\sim \mathcal{N}(\mu, \sigma), then the standard deviation of the average of NN independent samples is σN \frac{\sigma}{\sqrt{N}}
  • Z-statistics (one variable) z=xˉμσxˉ=xˉμσN z = \frac{\bar{x}-\mu}{\sigma_{\bar{x}}} = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{N}}} This can be analytically written in terms of the gamma function! fn(z)=Γ(n2)πΓ(n12)(1+z2)n2 f_n(z) = \frac{\Gamma\left(\frac{n}{2}\right)}{\sqrt{\pi} \Gamma\left( \frac{n-1}{2} \right)} \left(1 + z^2 \right)^{-\frac{n}{2}} which is crucial for constructing significance tests!
  • Z-statistics: (two variable) z=(xˉ1μ1)(xˉ2μ2)σ12N1+σ22N2 z = \frac{(\bar{x}_1 - \mu_1 ) - (\bar{x}_2 - \mu_2)}{\sqrt{\frac{\sigma_1^2}{N_1} + \frac{\sigma_2^2}{N_2}}}

This works in the sense that xˉ\bar{x} is a probability distribution on SNS^N which is an estimator for μ\mu. The variance diminishes as we take more I.I.D. samples.

Student's t distribution:

txˉμsNs1N1i(xixˉ)2 \begin{align*} t &\equiv \frac{\bar{x} - \mu}{\frac{s}{\sqrt{N}}} \\ s &\equiv \sqrt{\frac{1}{N-1} \sum_{i} (x_i-\bar{x})^2} \end{align*}

and we get fr=N1(t)=Γ[r+12]rπΓ(r2)(1+t2r)12(r+1) f_{r=N-1}(t) = \frac{\Gamma\left[\frac{r+1}{2} \right]}{\sqrt{r\pi} \Gamma\left(\frac{r}{2} \right) \left(1 + \frac{t^2}{r} \right)^{\frac{1}{2}(r+1)}} and E[tN1]=0 \probe[t_{N-1}] = 0

χ2\chi^2 distribution:

For a given σ\sigma χN12=(N1)s2σ2 \chi_{N-1}^2 = (N-1) \frac{s^2}{\sigma^2} and we find pr=N1=x0.5r1ex/2Γ(r2)20.5r p_{r=N-1} = \frac{x^{0.5r - 1}e^{-x/2}}{\Gamma\left(\frac{r}{2} \right)2^{0.5r}} and we conclude E[χN1]=N1,σN1=2(N1) \probe[\chi_{N-1}] = N-1, \quad \sigma_{N-1} = \sqrt{2(N-1)}

F distribution:

Two independent variables with χ2\chi^2 distributions with d.o.f n,mn, m resp. then Fn,m=χn2/nχm2/m F_{n,m} = \frac{\chi_n^2/n}{\chi_m^2/m} and we once again get an analytic PDF, expectation, and variance. Expectation and variance depend only on n,mn, m.

Broad picture:

  • Z-statistics: tests observed mean, increasing sample gives convergence of variance
  • Student-t test: tests observed mean, std. dev.
  • χ2\chi^2: tests observed variance against measured variance
  • F distribution: test two observed variances against each other.

Confidence intervals:

Assume NN independent samples drawn from a normal distribution with unknown expectation.

  • Denote mean as xˉ\bar{x}.
  • What is the interval II that the true mean μ\mu is expected to fall in with P(μI)>0.95 \probp(\mu \in I) > 0.95.
  • For two points, t.025,t0.025 t_{-.025}, t_{0.025} cumulative PDF
  • P(t0.025(N1))=0.025,\probp(t_{-0.025}(N-1)) = 0.025, P(t0.025(N1))=10.025\probp(t_{0.025}(N-1)) = 1 - 0.025 and therefore t0.025(N1)xˉμsNt0.025(N1) t_{-0.025}(N-1) \leq \frac{\bar{x} - \mu}{\frac{s}{\sqrt{N}}} \leq t_{0.025}(N-1) and so xˉsnt0.025(N1)μxˉsNt0.025(N1) \bar{x} - \frac{s}{\sqrt{n}}t_{0.025}(N-1) \leq \mu \leq \bar{x} - \frac{s}{\sqrt{N}} t_{-0.025}(N-1)

If we want one-sided estimate, then P(μ)=1α, \probp(\mu \geq \cdot) = 1 - \alpha, then use tα t_\alpha and discard t1αt_{1-\alpha} as above.

Hypothesis testing:

  • State significance level a priori.
  • State the null hypothesis and the algernative hypothesis
  • Perform double/one sided test.
  • Find appropriate statistics to use
  • Calculate statistics
  • Evaluate the calculation and accept/reject the hypothesis.

There is always a chance that you accept (reject) a false (true) statement.

Parent post: