title_temp

# STAD37 Final Exam Dec. 13, 2019 UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences

STAD37 Final Exam Dec. 13, 2019

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences

Name:

Student number:

Dec. 13, 2019 Final exam

STAD37: Multivariate Analysis Duration: 3 hours   Answer the questions in the spaces provided on the question sheets. If you run out of room for an answer, continue on the back of the page.

Aids allowed:

? A two-page double-sided formula sheet (The size of formula sheet can not be larger than 8.5*11 inch)

? Non-programmable and non-communicating calculator Good luck!! Question: 1 2 3 4 5 6 Total Points: 10 14 14 31 14 17 100 Score:   STAD37 Final Exam Page 2 of 20 1. True/Fasle questions (no need to give reasons)

1. (a) ?(2 points) If e is an eigenvector of A with eigenvalue λA, and e is also an eigenvector of B with eigenvalue λB , then e is an eigenvector of A ? B, and the corresponding eigenvalue is λA ? λB .

(a)

2. (b) ?(2 points) Suppose x1, . . . , xn are independent observations from any population with mean μ and nonsingular covariance Σ. Let x ? and S be the sample mean and sample covariance matrix, then (x ? ? μ)S?1(x ? ? μ) approximately follows χ2 distribution with the degrees of freedom p for n ? p large, where p is the dimension of vector μ.

(b)

3. (c) ?(2 points) The confidence region for the mean vector μ of a p-dimensioanl random vector x has a shape of p-dimensioanl ellipsoid.

(c)

4. (d) ?(2 points) The residual matrix, factor scores, communalities and specific variances are all unchanged under orthogonal factor rotations.

(d)

5. (e) ?(2 points) In the classification problem, the apparent error rate can be calculated from the confusion matrix. Besides, the apparent error rate tends to overestimate the actual error rate even if the sample sizes are large.

(e)      2. Assume x = ??x1

Final Exam
x2 x3??follows the normal distribution N3(μ, Σ) with

?2? ?1 1 1? μ=?1? and Σ=?1 4 2?

Page 3 of 20 .

?3 123 (a) (4 points) Find the distribution of 3x1 + x2 ? 2x3.

??c1 ??
(b) (4points) Assume that c = c is a 2×1 vector. In order to make x3 and

2

independent, what condition does c need to satisfy?

x3 ? c x

2

??x1?? Page 4 of 20 ??x1 ?x2??
(c) (6 points) Find the conditional distribution of x ? x given the value of x2.

23 STAD37 Final Exam Page 5 of 20 3. Assume a symmetric 3 × 3 matrix, A, has the following normalized eigenvectors and

corresponding eigenvalues:

(a) (3 points) Write the original matrix A. (b) (2 points) Calculate the determinant of A.

(c) (3 points) Calculate A?1/2.

e1 =??0 0 1??, λ1 =16, e2=??3 4 0??, λ2=100,

55
e3=???54 35 0??, λ3=25. STAD37 Final Exam Page 6 of 20 (d) (6 points) Prove that if a p × p symmetric matrix B is positive definite, then B?1 is also positive definite. (You may assume without proof that B?1 exists. Your proof should be based directly on the definition of positive definiteness, and may not use any theorems that mention positive definiteness.) STAD37 Final Exam Page 7 of 20 4. Suppose that there are two independent random samples, which are from population π1 and population π2, respectively. The sample sizes of the two random samples are n1 = 41 and n2 = 61. Assume that population π1 follows the bivariate normal distribution of N2(μ1,Σ1) and population π2 follows the bivariate normal distribution of N2(μ2,Σ2). The sample means and sample covariance matrices are given by

??2?? ??2 1?? ??1?? ??4 1?? x ? 1 = 6 , S 1 = 1 5 , x ? 2 = 8 , S 2 = 1 3 .

For the following questions of (a)-(e), we assume that Σ1 = Σ2 = Σ.

(a) (5 points) Test H0 : μ1 = μ2 using the Hotelling’s T2 with the significance level α = 0.05. (Hint: If you can not find the exact value of critical value from the tables, use the closest value to instead) STAD37 Final Exam Page 8 of 20 (b) (4 points) Construct the 95% simultaneous T 2 confidence intervals for (μ11 ? μ21)

and (μ12 ? μ22). Note that μ1 = ??μ11 μ12??and μ2 = ??μ21 μ22??. (c) (6 points) Construct the estimated minimum ECM classification rule. Final Exam Page 9 of 20 (d)

(3 points) Assume equal prior probabilities, c(1|2) = 20 and c(2|1) = 100, allo- cate the observation x0 = ??3 7??to population π1 or population π2 based on the classification rule in (c).

(e)

(5 points) Assume equal prior probabilities, allocate the observation x0 = ??3 7??to population π1 or population π2 using the estimated Bayes’s classifer. STAD37 Final Exam Page 10 of 20 (f) (8 points) Test H0 : Σ1 = Σ2 using Box’s test (Bartlett’s test) with the significance level α = 0.05. According to the test result, does the classification rule obtained in (c) need to be adjusted? STAD37 Final Exam Page 11 of 20 5. Observations on two responses are collected for three treatments. The observations vectors ??x1 x2??are

??5?? ??10?? ??5?? ??8?? Treatment1: 6, 8 ,4,6,

??4?? ??5?? ??3?? Treatment2: 4 , 3 , 2 ,

??2?? ??5?? ??6?? ??3?? ??4?? Treatment3: 1,2,4,3,5.

(a) (10 points) Compute Wilks’ Lambda Λ? for testing the treatment effects. STAD37 Final Exam Page 12 of 20 You can write here if the space is not enough for question (a) STAD37 Final Exam Page 13 of 20 (b) (4 points) Use the test statistic about Wilks’ Lambda to test for treatment effects

with α = 0.05. (Hint: The test statistic is g?1 Λ? with N = ??l=1 nl

?????????
N?g?1 1?Λ g and g being the number of groups, the test statistic follows the distribution of F2(g?1),2(N?g?1) under H0). STAD37 Final Exam Page 14 of 20 6. Consider the data on the national track records for men. The variables are: country, 100m (s), 200m (s), 400m (s), 800m (min), 1500m (min), 5000m (min), 10,000m (min) and Marathon (min). There are 54 countries. Part of the data is as follows

```  > head(Track, 30)
Country  100m 200m 400m 800m 1500m 5000m  10,000m  Marathon
```
```                                                        27.6      130
27.5      128
27.7      132
26.9      127
30.5      146
28.1      126
27.6      130
28.1      132
28.2      129
27.9      131
35.4      171
28.8      133
27.8      132
27.9      129
30.4      146
27.5      131
27.4      126
27.4      128
```
```1
2
3
4
5
6
7
8
9
10
11
12
13  CzechRepublic 10.24 20.6 45.8 1.75  3.58  13.4
14        Denmark 10.29 20.5 45.9 1.69  3.52  13.4
15 DominicanRepub 10.16 20.6 44.9 1.81  3.73  14.3
```
```  Argentina 10.23 20.4 46.2 1.77  3.68  13.3
Australia  9.93 20.1 44.4 1.74  3.53  12.9
Austria 10.15 20.4 45.8 1.77  3.58  13.3
Belgium 10.14 20.2 45.0 1.73  3.57  12.8
Bermuda 10.27 20.3 45.3 1.79  3.70  14.6
Brazil 10.00 19.9 44.3 1.70  3.57  13.5
Canada  9.84 20.2 44.7 1.75  3.53  13.2
Chile 10.10 20.1 45.9 1.76  3.65  13.4
China 10.17 20.4 45.2 1.77  3.61  13.4
Columbia 10.29 20.9 45.8 1.80  3.72  13.5
CookIslands 10.97 22.5 51.4 1.94  4.24  16.7
CostaRica 10.32 21.0 46.4 1.87  3.84  13.8
```
```Finland 10.21 20.5 45.5 1.74  3.61  13.3
France 10.02 20.2 44.6 1.72  3.48  13.0
Germany 10.06 20.2 44.3 1.73  3.53  12.9
```
```16
17
18
19   GreatBritain  9.87 19.9 44.4 1.70  3.49  13.0    27.3      127
```
```20
21
22
23
24
25
26
27
28
29
30
```
```     Greece 10.11 19.9 45.6 1.75  3.61  13.5
Guatemala 10.32 21.1 48.4 1.82  3.74  14.0
Hungary 10.08 20.1 45.4 1.76  3.59  13.4
India 10.33 20.7 45.5 1.76  3.63  13.5
Indonesia 10.20 20.9 46.4 1.83  3.77  14.2
Ireland 10.35 20.5 45.6 1.75  3.56  13.1
Israel 10.20 20.9 46.6 1.80  3.70  13.7
Italy 10.01 19.7 45.3 1.73  3.55  13.1
Japan 10.00 20.0 44.8 1.77  3.62  13.2
Kenya 10.28 20.4 44.2 1.70  3.44  12.7
Korea,South 10.34 20.4 45.4 1.74  3.64  13.8
```
```28.1      132
29.3      133
28.0      132
28.8      132
29.6      139
27.8      129
28.7      134
27.3      127
27.6      126
26.5      125
28.5      127
``` STAD37 Final Exam Page 15 of 20 Answer the following questions according to the R output given after the questions.

1. (a) ?(2 points) How many principal components to retain if the proportion of total stan- dardized sample variance explained by the principal components should be at least 90% but not larger than 94%?

2. (b) ?(4 points) Provide interpretations for the first and second principal components.

(c) (4 points) The factor analysis using the maximum likelihood method is given in the output. The factors are rotated by using the “varimax rotation”. Interpret the two “rotated” factors.

(d) (2 points) Based on the estimated two-factor model, calcualte the communality for the variable “Marathon”. Final Exam Page 16 of 20 (e)

(2 points) For the maximum likelihood method, are the two common factors suf- ficient based on the likelihood ratio test using Bartlett’s correction at the level α = 0.01? (Hint: Use the R output directly)

(f)

(3 points) Figure 1 shows the scatter plot for the factor scores. Are there any outliers in the data? Explain the reason.

Kenya Ireland
Spain CzechRepublic

India

Taiwan

Korea,South

Chile

Malaysia Indonesia

PapuaNewGuinea Singapore

DominicanRepub

Bermuda Mauritius

Belgium

Myanmar(Burma) Korea,NorPthilippines

CookIslands

Samoa

Turkey

Columbia Romania

Guatemala CostaRica

NewZeSawlaenden Switzerland China

Netherlands Germany

France Russia Portugal

Thailand

Luxembourg Israel

Norway

AustraliCJaapnanda

GreatBritain Italy

Hungary

Poland
Greece

Brazil

U.S.A.

?1 0 1 2 3 Factor 1

Figure 1: Scatter plot for the factor scores Factor 2
?2 ?1 0 1 2

Final Exam

Page 17 of 20 R output for Question 6:

```> Y <- Track[ , -1]
> Track.pca = prcomp(Y, scale=TRUE)
> summary(Track.pca)
Importance of components:
```
```                        PC1    PC2    PC3
Standard deviation     2.589 0.7990 0.4770 0.4537 0.3124
```
```Proportion of Variance 0.838 0.0798 0.0284 0.0257 0.0122
Cumulative Proportion  0.838 0.9177 0.9462 0.9719 0.9841
```
```                        PC6     PC7     PC8
Standard deviation     0.26587 0.21666 0.09858
Proportion of Variance 0.00884 0.00587 0.00121
Cumulative Proportion  0.99292 0.99879 1.00000
```
```> Track.pca\$rotation
PC1     PC2      PC3     PC4     PC5     PC6
```
```100m     -0.332 -0.5294 -0.34386 -0.3807  0.2997 -0.3620
200m     -0.346 -0.4704  0.00379 -0.2170 -0.5414  0.3486
400m     -0.339 -0.3453  0.06706  0.8513  0.1330  0.0771
800m     -0.353  0.0895  0.78271 -0.1343 -0.2273 -0.3413
1500m    -0.366  0.1537  0.24427 -0.2330  0.6516  0.5298
5000m    -0.370  0.2948 -0.18286  0.0546  0.0718 -0.3591
10,000m  -0.366  0.3336 -0.24398  0.0871 -0.0613 -0.2731
Marathon -0.354  0.3866 -0.33463 -0.0181 -0.3379  0.3752
```
```           PC7      PC8
0.348 -0.06570
-0.440  0.06076
0.114 -0.00347
0.259 -0.03927
-0.147 -0.03975
-0.328  0.70568
10,000m  -0.351 -0.69718
Marathon  0.594  0.06932
```
```> Track.factor <- factanal(Y, 2, cor=T, rotation="varimax",
scores="regression")
> Track.factor
```
```Call:
factanal(x = Y, factors = 2, scores = "regression", rotation = "varimax",
cor = T)
```
```100m
200m
400m
800m
1500m
5000m
```
```Uniquenesses:
100m     200m     400m     800m
```
```0.135    0.037    0.228    0.212
10,000m Marathon
0.011    0.088
```
```1500m    5000m
0.134    0.012
```

PC4 PC5 Final Exam

Page 18 of 20 ```Loadings:
Factor1 Factor2
```
```100m     0.397
200m     0.404
400m     0.511
800m     0.667
1500m    0.745
5000m    0.883
10,000m  0.897   0.429
```
```Marathon 0.863
```
```SS loadings
Proportion Var
Cumulative Var
```

0.410

```Factor1 Factor2
3.912   3.231
0.489   0.404
0.489   0.893
```
```0.841
0.894
0.714
0.585
0.558
0.455
```
```Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 25.9 on 13 degrees of freedom.
The p-value is 0.0173
```
```> Track.scores <- Track.factor\$scores
> Track.df <- data.frame(Y, Track.scores)
> rownames(Track.df) = Track[ ,1]
> attach(Track.df)
> plot(Factor1, Factor2, type=‘n’, xlab=‘Factor 1’, ylab=‘Factor 2’)
> text(Factor1, Factor2, labels = row.names(Track.df), cex = 0.7, col="blue")
``` STAD37 Final Exam Page 19 of 20   STAD37 Final Exam Page 20 of 20   