e.g. for a bivariate normal
p(x,y)=πσ1σ21−r21exp[−2(1−r)21(σ12(x−μ1)2+σ22(y−μ2)2−2rσ1σ2(x−μ1)(x−μ2))].
"Independence":
Statistical independence: p(x,y)=p(x)p(y). I.e. the probability of events
Linear independence: r=0, i.e. principal axes of level sets of pdf are orthogonal.
Physical independence: causal statement from domain knowledge.
Estimation of pdfs:
Histograms: choice of parameters (e.g., bin size)
Naively can estimate sensitivity to bin size (even in eyeball norm!)
Exercise: take a large-ish climate dataset (e.g., 300hPa tropical relative humidity) ~300,000 samples. Take 3000 datapoint subset. Make histograms.
Kernel density estimation
Many parameters, many methods.
Can get fairly rigorous convergence results under mild assumptions (can these be tested directly on data?)
Correlation and causality:
A statistically significant correlation should be analyzed in context of, e.g., length of data record.
A simple scatter plot can serve as a gut check (n.b. I use this instead of sanity check) for correctness.
Do you have a physically plausible interpretation of correlation? A curious correlation can serve as a start of inquiry, but it is almost never proof in and of itself.
Lecture 03: Statistics
Gamma function:
Γ(z)≡∫0∞tz−1e−tdtΓ(n)=(n−1)!=i=1∏n−1i
Useful distributions:
Suppose we have an infinite population ∼N(μ,σ),
then the standard deviation of the average of N independent samples is Nσ
Z-statistics (one variable)
z=σxˉxˉ−μ=Nσxˉ−μ
This can be analytically written in terms of the gamma function!
fn(z)=πΓ(2n−1)Γ(2n)(1+z2)−2n
which is crucial for constructing significance tests!
This works in the sense that xˉ is a probability distribution on SN which is an estimator for μ.
The variance diminishes as we take more I.I.D. samples.
Student's t distribution:
ts≡Nsxˉ−μ≡N−11i∑(xi−xˉ)2
and we get
fr=N−1(t)=rπΓ(2r)(1+rt2)21(r+1)Γ[2r+1]
and
E[tN−1]=0
χ2 distribution:
For a given σχN−12=(N−1)σ2s2
and we find
pr=N−1=Γ(2r)20.5rx0.5r−1e−x/2
and we conclude
E[χN−1]=N−1,σN−1=2(N−1)
F distribution:
Two independent variables with χ2 distributions with d.o.f n,m resp. then
Fn,m=χm2/mχn2/n
and we once again get an analytic PDF, expectation, and variance. Expectation and variance depend only on n,m.
Broad picture:
Z-statistics: tests observed mean, increasing sample gives convergence of variance
Student-t test: tests observed mean, std. dev.
χ2: tests observed variance against measured variance
F distribution: test two observed variances against each other.
Confidence intervals:
Assume N independent samples drawn from a normal distribution with unknown expectation.
Denote mean as xˉ.
What is the interval I that the true mean μ is expected to fall in with P(μ∈I)>0.95.
For two points, t−.025,t0.025 cumulative PDF
P(t−0.025(N−1))=0.025,P(t0.025(N−1))=1−0.025
and therefore
t−0.025(N−1)≤Nsxˉ−μ≤t0.025(N−1)
and so
xˉ−nst0.025(N−1)≤μ≤xˉ−Nst−0.025(N−1)
If we want one-sided estimate, then P(μ≥⋅)=1−α, then use tα and discard t1−α as above.
Hypothesis testing:
State significance level a priori.
State the null hypothesis and the algernative hypothesis
Perform double/one sided test.
Find appropriate statistics to use
Calculate statistics
Evaluate the calculation and accept/reject the hypothesis.
There is always a chance that you accept (reject) a false (true) statement.