title: “Statistics 2: Computer Practical 2”
output:
pdf_document:
number_sections: yes
fig_height: 4
html_document:
number_sections: yes
urlcolor: blue
knitr::opts_chunk$set(echo = TRUE)
set.seed(12345)
Statistics 2 mainly focuses on the scenario where the data is modelled as a vector of observations of i.i.d. random variables, as this scenario is sufficiently rich mathematically to capture many fundamental statistical ideas.
In practice, Statistics is often used to understand statistical relationships within data. For example the relationship between some recorded variables and a response
We consider in this practical a specific variant of this where each response is modelled as a Bernoulli random variable with a mean that depends on the other recorded variables in a simple way. The Bernoulli random variables in the model are independent but not identically distributed. This is an extension of what was considered in CP1, where the only recorded variable was whether the clinic had instituted the policy of chlorine hand-washing.
We consider the data, but not the analysis, of
Leighton SP, Krishnadas R, Chung K, Blair A, Brown S, Clark S, et al. (2019) Predicting one-year outcome in first episode psychosis using machine learning. PLoS ONE 14(3): e0212846.
Specifically, we are interested in understanding how certain factors measured at initial presentation statistically affect whether a person is in employment, education or training (EET) one year after first episode pyschosis (FEP). Motivation given by Leighton et al. (2019) includes
If you are interested in further context, you can read the paper.
The data provided by the authors of the paper is richer than what we consider here. Specifically, we will consider the records of 130 people with the following measurements:
M0_Emp
: Whether the person is in EET at initial presentation.Female
: Whether the person is female or male (the only recorded values).Parent
: Whether the person is a parent.Age
: Age in years.PANSS_G
: Total score on the PANSS General Psychopathology scale.PANSS_P
: Total score on the PANSS Positive scale.PANSS_N
: Total score on the PANSS Negative scale.Y1_Emp
: Whether the person is in EET after one year.All binary variables have 0 corresponding to “No” and 1 corresponding to “Yes”.
We will be interested in trying to understand (statistically) how the first 7 variables affects the 8th variable.
You can read about the Positive and Negative Syndrome Scale if you are interested.
Consider the model
Y
i
∼
ind
B
e
r
n
o
u
l
l
i
(
σ
(
θ
T
x
i
)
)
,
i
∈
{
1
,
…
,
n
}
,
Y_i \overset{\text{ind}}{\sim} {\rm Bernoulli}(\sigma(\theta^T x_i)), \qquad i \in \{1,\ldots,n\},
Yi∼indBernoulli(σ(θTxi)),i∈{1,…,n},
where
x
1
,
…
,
x
n
x_1,\ldots,x_n
x1,…,xn are
d
d
d-dimensional real vectors of explanatory variables, and
σ
\sigma
σ is the standard logistic function
σ
(
z
)
=
1
1
+
exp
(
−
z
)
.
\sigma(z) = \frac{1}{1+\exp(-z)}.
σ(z)=1+exp(−z)1.
The data consists of observed realizations of
(
Y
1
,
…
,
Y
n
)
(Y_1,\ldots,Y_n)
(Y1,…,Yn),
y
=
(
y
1
,
…
,
y
n
)
{\bf y} = (y_1,\ldots,y_n)
y=(y1,…,yn), as well as
(
x
1
,
…
,
x
n
)
(x_1,\ldots,x_n)
(x1,…,xn). We define the
n
×
d
n \times d
n×d matrix
X
=
(
x
i
j
)
X = (x_{ij})
X=(xij).
You should be clear that the x i x_i xi are not modelled as observations of random variables. They are deterministic quantities that are observed. Only Y 1 , … , Y n Y_1,\ldots,Y_n Y1,…,Yn are random variables.
Question 1. [1 mark] Show that the log-likelihood function is
ℓ ( θ ; y ) = ∑ i = 1 n y i log ( σ ( θ T x i ) ) + ( 1 − y i ) log ( 1 − σ ( θ T x i ) ) . \ell(\theta ; {\bf y}) = \sum_{i=1}^n y_i \log (\sigma(\theta^Tx_i)) + (1-y_i) \log (1 - \sigma(\theta^Tx_i)). ℓ(θ;y)=i=1∑nyilog(σ(θTxi))+(1−yi)log(1−σ(θTxi)).
Question 2. [1 mark] Show that each component of the score is
\begin{equation}
\label{eq:scorecomponent}
\frac{\partial \ell (\theta ; {\bf y})}{\partial \theta_j} = \sum_{i=1}^n [y_i - \sigma(\theta^T x_i)] x_{ij},\qquad j \in {1,\ldots,d }.
\end{equation}
I suggest that you use the fact that σ ′ ( z ) = σ ( z ) [ 1 − σ ( z ) ] \sigma'(z) = \sigma(z)[1-\sigma(z)] σ′(z)=σ(z)[1−σ(z)].
In fact, \autoref{eq:scorecomponent} implies that the score can be written as
∇
ℓ
(
θ
;
y
)
=
X
T
[
y
−
p
(
θ
)
]
,
\nabla \ell(\theta; {\bf y}) = X^T [{\bf y} - {\bf p}(\theta)],
∇ℓ(θ;y)=XT[y−p(θ)],
where
p
(
θ
)
{\bf p}(\theta)
p(θ) is the vector
(
p
1
(
θ
)
,
…
,
p
n
(
θ
)
(p_1(\theta),\ldots,p_n(\theta)
(p1(θ),…,pn(θ) where
p
i
(
θ
)
=
σ
(
θ
T
x
i
)
p_i(\theta) = \sigma(\theta^T x_i)
pi(θ)=σ(θTxi).
Question 3. [1 mark] Show that the Hessian matrix entries are
In fact, \autoref{eq:hessiancomponents} implies that the Hessian can be written as
∇
2
ℓ
(
θ
;
y
)
=
−
X
T
D
(
θ
)
X
,
\nabla^2 \ell(\theta; {\bf y}) = - X^T D(\theta) X,
∇2ℓ(θ;y)=−XTD(θ)X,
where
D
(
θ
)
D(\theta)
D(θ) is the
n
×
n
n \times n
n×n diagonal matrix where the
i
i
ith diagonal entry is
p
i
(
θ
)
[
1
−
p
i
(
θ
)
]
p_i(\theta)[1-p_i(\theta)]
pi(θ)[1−pi(θ)] for
i
∈
{
1
,
…
,
d
}
i \in \{1,\ldots,d\}
i∈{1,…,d}.
Notice that for this particular model, the Hessian ∇ 2 ℓ ( θ ; y ) \nabla^2 \ell(\theta; {\bf y}) ∇2ℓ(θ;y) does not depend on y {\bf y} y.
It is not difficult to see that the Hessian is negative definite for every
θ
∈
R
d
\theta \in \mathbb{R}^d
θ∈Rd. Indeed, let
z
∈
R
d
z \in \mathbb{R}^d
z∈Rd such that
z
≠
0
z \neq 0
z=0. Then,
z
T
∇
2
ℓ
(
θ
;
y
)
z
=
−
(
X
z
)
T
D
(
θ
)
(
X
z
)
<
0
,
z^T \nabla^2 \ell(\theta; {\bf y}) z = - (Xz)^T D(\theta) (Xz) < 0,
zT∇2ℓ(θ;y)z=−(Xz)TD(θ)(Xz)<0,
since
D
(
θ
)
D(\theta)
D(θ) is a diagonal matrix with positive diagonal entries and is therefore positive definite.
The following function return the log-likelihood, score vector and Hessian matrix associated with a given θ \theta θ, X X X and y {\bf y} y. You do not need to understand how exactly they compute these quantities in order to complete the practical, but you should understand what they compute.
sigma <- function(v) {
1/(1+exp(-v))
}
ell <- function(theta, X, y) {
p <- as.vector(sigma(X%*%theta))
sum(y*log(p) + (1-y)*log(1-p))
}
score <- function(theta, X, y) {
p <- as.vector(sigma(X%*%theta))
as.vector(t(X)%*%(y-p))
}
hessian <- function(theta, X) {
p <- as.vector(sigma(X%*%theta))
-t(X)%*%((p*(1-p))*X)
}
First we load the FEP-EET data, and inspect the first few rows of data.
fep.eet <- read.csv("FEP_EET.csv")
head(fep.eet)
In order to calculate the ML estimate, we should extract the matrix X X X of explanatory variables and the vector y {\bf y} y of responses.
X.raw <- as.matrix(fep.eet[,1:7])
y <- fep.eet$Y1_Emp
It is useful to add a column of 1 1 1s to X X X, so that there is an “intercept” term in the model. This is what we did in CP1 with the mortality data, so that we could distinguish between a “baseline” probability and the impact of chlorine hand-washing. Mathematically, the value of θ 1 \theta_1 θ1 determines the probability when the explanatory variables are all 0.\footnote{Even if this is impossible: in this dataset the youngest Age is 15 and the minimum possible score for the PANSS scale overall is 30.}
X <- cbind(1, X.raw)
head(X)
d <- 8
We are now in a position to compute the maximum likelihood estimate. The following code computes the ML estimate using R’s general-purpose optim
function. Unlike some simpler numerical optimization problems, this is already challenging enough that one must choose the options somewhat carefully.
maximize.ell <- function(ell, score, X, y, theta0) {
optim.out <- optim(theta0, fn=ell, gr=score, X=X, y=y, method="BFGS",
control=list(fnscale=-1, maxit=1000, reltol=1e-16))
optim.out$par
}
mle <- maximize.ell(ell, score, X, y, rep(0,d))
mle
It appears that being in EET at month 0 greatly improves the probability of being in EET after 1 year. Being female appears to lower the probability, while being a parent increases it. Age seems to have a small negative effect. With respect to the PANSS scores, higher general scores appear to improve the probability, while higher positive and negative scores seem to decrease it. Naturally, we don’t necessarily have much confidence in any of these statements: the true parameter values could be 0 or even have different signs to the estimated values.
Although the parameter values for age and the PANSS scores are quite small, it’s worth bearing in mind that these parameters multiply larger integers, e.g. ages are between 15 and 45 in the data we are analyzing.
So far all we have done is find the maximizer of the log-likelihood function, i.e. the ML estimate of θ \theta θ. What you will have to do now, is produce observed “Wald”-type confidence intervals for each of the components of θ \theta θ.
In lectures we have seen that for regular statistical models with a one-dimensional parameter
θ
\theta
θ, the ML estimator
θ
^
n
\hat{\theta}_n
θ^n is asymptotically normal with
I
n
(
θ
^
n
)
1
/
2
(
θ
^
n
−
θ
)
=
n
I
(
θ
^
n
)
(
θ
^
n
−
θ
)
→
D
(
⋅
;
θ
)
Z
∼
N
(
0
,
1
)
.
I_n(\hat{\theta}_n)^{1/2}(\hat{\theta}_n - \theta) = \sqrt{n I(\hat{\theta}_n)}(\hat{\theta}_n - \theta) \to_{\mathcal{D}(\cdot;\theta)} Z \sim N(0,1).
In(θ^n)1/2(θ^n−θ)=nI(θ^n)(θ^n−θ)→D(⋅;θ)Z∼N(0,1).
This convergence in distribution justifies the construction of Wald confidence intervals for
θ
\theta
θ.
In this computer practical, the statistical model has
d
d
d-dimensional parameters and the observed random variables are independent but not identically distributed. Nevertheless, for this model under some appropriate regularity assumptions on
x
1
,
x
2
,
…
x_1,x_2,\ldots
x1,x2,…, the ML estimator
θ
^
n
\hat{\theta}_n
θ^n is asymptotically (multivariate) normal in the sense that
I
n
(
θ
)
1
/
2
(
θ
^
n
−
θ
)
I_n(\theta)^{1/2}(\hat{\theta}_{n}-\theta)
In(θ)1/2(θ^n−θ)
converges in distribution to a vector of
d
d
d independent standard normal random variables, where
I
n
(
θ
)
I_n(\theta)
In(θ) is the Fisher information matrix
I
n
(
θ
)
=
−
E
[
∇
2
ℓ
(
θ
;
Y
1
,
…
,
Y
n
)
;
θ
]
.
I_n(\theta) = - \mathbb{E}[\nabla^2 \ell(\theta; Y_1,\ldots,Y_n) ; \theta].
In(θ)=−E[∇2ℓ(θ;Y1,…,Yn);θ].
In our case, the Fisher information matrix is precisely the negative Hessian of the log-likelihood, because the Hessian of the log-likelihood does not depend on
y
{\bf y}
y.
One can deduce from this multivariate asymptotic normality that for
j
∈
{
1
,
…
,
d
}
j \in \{1,\ldots,d\}
j∈{1,…,d},
θ
^
n
,
j
−
θ
j
(
I
n
(
θ
)
−
1
)
j
j
→
D
(
⋅
;
θ
)
Z
∼
N
(
0
,
1
)
,
\frac{\hat{\theta}_{n,j}-\theta_{j}}{\sqrt{(I_n(\theta)^{-1})_{jj}}} \to_{\mathcal{D}(\cdot;\theta)} Z \sim N(0,1),
(In(θ)−1)jjθ^n,j−θj→D(⋅;θ)Z∼N(0,1),
where
θ
^
n
,
j
\hat{\theta}_{n,j}
θ^n,j denotes the
j
j
jth component of
θ
^
n
\hat{\theta}_n
θ^n and
θ
j
\theta_j
θj denotes the
j
j
jth component of
θ
\theta
θ.
Notice that
(
I
n
(
θ
)
−
1
)
j
j
(I_n(\theta)^{-1})_{jj}
(In(θ)−1)jj is the
j
j
jth diagonal entry of the inverse of the Fisher information matrix, and is not in general equal to
(
I
n
(
θ
)
j
j
)
−
1
(I_n(\theta)_{jj})^{-1}
(In(θ)jj)−1, the inverse of the
j
j
jth diagonal entry of the Fisher information matrix\footnote{This is statistically interesting, as it captures the fact that our estimators are less precise in the presence of more parameters.}. In R
you can compute numerically the inverse of a matrix using the solve
command.
As in the one-dimensional parameter case, when the function
θ
↦
(
I
(
θ
)
−
1
)
j
j
\theta\mapsto(I(\theta)^{-1})_{jj}
θ↦(I(θ)−1)jj is continuous, one can replace
(
I
n
(
θ
)
−
1
)
j
j
(I_n(\theta)^{-1})_{jj}
(In(θ)−1)jj with
(
I
n
(
θ
^
n
)
−
1
)
j
j
(I_n(\hat{\theta}_n)^{-1})_{jj}
(In(θ^n)−1)jj to obtain
θ
^
n
,
j
−
θ
j
(
I
n
(
θ
^
n
)
−
1
)
j
j
→
D
(
⋅
;
θ
)
Z
∼
N
(
0
,
1
)
.
\frac{\hat{\theta}_{n,j}-\theta_{j}}{\sqrt{(I_n(\hat{\theta}_{n})^{-1})_{jj}}}\to_{\mathcal{D}(\cdot;\theta)}Z\sim N(0,1).
(In(θ^n)−1)jjθ^n,j−θj→D(⋅;θ)Z∼N(0,1).
Question 4. [3 marks] Compute the lower and upper endpoints of observed asymptotically exact
1
−
α
1-\alpha
1−α “Wald” confidence intervals for each component of
θ
\theta
θ, for
α
=
0.05
\alpha=0.05
α=0.05. I recommend that you write a function, so you can use it in later questions.
compute.CI.endpoints <- function(X, y, alpha) {
mle <- maximize.ell(ell, score, X, y, rep(0,d))
# some code here
# compute the lower and upper endpoints
# lower and upper should be vectors whose length is the same length as mle
lower <- mle - 1 # obviously wrong
upper <- mle + 1 # obviously wrong
return(list(lower=lower,upper=upper))
}
ci <- compute.CI.endpoints(X, y, 0.05)
If you have written the compute.CI.endpoints
function above, the following code will visualize the observed confidence intervals in two separate plots.
ci <- compute.CI.endpoints(X, y, 0.05)
plot.ci <- function(mle, CI.L, CI.U, components) {
plot(components, mle[components], pch=20, main="Observed confidence intervals",
xlab="component", ylab="value", ylim=c(min(CI.L[components]), max(CI.U[components])))
arrows(components, CI.L[components], components, CI.U[components], length=0.05, angle=90, code=3)
abline(h=0.0, col="red")
axis(side=1, at=components, labels = FALSE)
}
plot.ci(mle, ci$lower, ci$upper, 1:4)
plot.ci(mle, ci$lower, ci$upper, 5:8)
Now we will perform repeated experiments under the assumption that θ ∗ = ( − 0.7 , 3.5 , 0 , 0 , 0 , 0 , − 0.08 , 0 ) \theta^* = (-0.7,3.5,0,0,0,0,-0.08,0) θ∗=(−0.7,3.5,0,0,0,0,−0.08,0). Notice that if the true value of a component of θ \theta θ is exactly 0 0 0, this means that the corresponding explanatory variable has no effect on the probability of the response. For this reason, it is often interesting to look at whether an observed confidence interval includes or excludes the value 0 0 0.
In order to perform repeated experiments, one needs to be able to simulate data according to the model. The following function should be helpful.
# generate data associated with the matrix X when theta is the true value of the parameter
generate.ys <- function(X, theta) {
n <- dim(X)[1]
rbinom(n, size = 1, prob=sigma(X%*%theta))
}
Question 5. [2 marks] Under the assumption that θ ∗ = ( − 0.7 , 3.5 , 0 , 0 , 0 , 0 , − 0.08 , 0 ) \theta^* = (-0.7,3.5,0,0,0,0,-0.08,0) θ∗=(−0.7,3.5,0,0,0,0,−0.08,0), letting α = 0.05 \alpha=0.05 α=0.05 and using the same X X X matrix, approximate:
I suggest you write functions using the following template.
# returns the proportion of trials in which the condition holds
prop.condition <- function(trials, X, theta, component) {
count <- 0
for (i in 1:trials) {
# simulate synthetic data using X and theta
# compute observed confidence interval(s)
# if condition is satisfied, increment count
# e.g. if theta[component] is in the relevant observed confidence interval,
# or if the relevant observed confidence interval excludes 0
}
count/trials
}
Now for the purpose of performing repeated experiments, we will assume that we have twice as much data, in the sense that there are two Bernoulli observations for each row of the X X X matrix. One can equivalently construct a larger X X X matrix with two copies of each row in the original X X X matrix, as follows:
big.X <- rbind(X, X)
Question 6. [2 marks] Under the assumption that
θ
∗
=
(
−
0.7
,
3.5
,
0
,
0
,
0
,
0
,
−
0.08
,
0
)
\theta^* = (-0.7,3.5,0,0,0,0,-0.08,0)
θ∗=(−0.7,3.5,0,0,0,0,−0.08,0), letting
α
=
0.05
\alpha=0.05
α=0.05 and using now the big.X
matrix, approximate:
Explain any differences between the results for the previous question.
Analyzing this data using this model does suggest a possible statistical relationship between a higher positive scale PANSS score and reduced probability of being in EET after one year. You may notice that the statistical evidence is much weaker than in CP1, where we looked at the relationship between chlorine hand-washing and mortality after childbirth.
It is not possible to give a causal interpretation to this statistical relationship. For example, it is not clear from the data and the analysis how or why higher positive scale scores have any effect on EET: this could be because such scores are associated with characteristics that hinder ability to engage in EET, or that those characteristics are more highly stigmatized, or various other possible explanations.
The study motivates understanding the statistical relationship between PANSS scores and EET as a route to improve vocational interventions. You may be interested to know that the effect of such interventions, e.g. Individual Placement and Support is often also analyzed from a statistical perspective.
Finally, you may be happy to learn that some of what we have done in this computer practical can be done with less effort using standard R functionality. For example, for this dataset one could produce the ML estimate and particular observed confidence intervals via three simple commands. However, using statistical software without understanding what it is doing it can easily lead to serious errors in decision-making.
model <- glm(Y1_Emp ~ ., family=binomial, data=fep.eet)
summary(model)
confint(model, level=0.95)
Note that the observed confidence intervals produced are slightly different to observed Wald confidence intervals, so you will not get any marks if you report these as your answers for Q4. The point of this practical is for you to learn how such quantities can be calculated.
If you are interested in linear and generalized linear models, like this one, there is a third year unit you can take. These models are very widely used.