title_temp

STAD37 Final Exam Dec. 13, 2019 UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences

STAD37 Final Exam Dec. 13, 2019

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences

Name:

Student number:

Dec. 13, 2019 Final exam

STAD37: Multivariate Analysis Duration: 3 hours

page1image21883776 page1image21882816page1image21885312

Answer the questions in the spaces provided on the question sheets. If you run out of room for an answer, continue on the back of the page.

Aids allowed:

? A two-page double-sided formula sheet (The size of formula sheet can not be larger than 8.5*11 inch)

? Non-programmable and non-communicating calculator
There are 20 pages including this page. Please check to see you have all the pages.

page1image21228032

Good luck!!

page1image21228224

Question: 1 2 3 4 5 6 Total Points: 10 14 14 31 14 17 100 Score:

page1image21224768 page1image21227264page1image21225344

STAD37 Final Exam Page 2 of 20

page2image29319808

1. True/Fasle questions (no need to give reasons)

  1. (a) ?(2 points) If e is an eigenvector of A with eigenvalue λA, and e is also an eigenvector of B with eigenvalue λB , then e is an eigenvector of A ? B, and the corresponding eigenvalue is λA ? λB .

    (a)

  2. (b) ?(2 points) Suppose x1, . . . , xn are independent observations from any population with mean μ and nonsingular covariance Σ. Let x ? and S be the sample mean and sample covariance matrix, then (x ? ? μ)S?1(x ? ? μ) approximately follows χ2 distribution with the degrees of freedom p for n ? p large, where p is the dimension of vector μ.

    (b)

  3. (c) ?(2 points) The confidence region for the mean vector μ of a p-dimensioanl random vector x has a shape of p-dimensioanl ellipsoid.

    (c)

  4. (d) ?(2 points) The residual matrix, factor scores, communalities and specific variances are all unchanged under orthogonal factor rotations.

    (d)

  5. (e) ?(2 points) In the classification problem, the apparent error rate can be calculated from the confusion matrix. Besides, the apparent error rate tends to overestimate the actual error rate even if the sample sizes are large.

    (e)

page2image29322880 page2image29325760 page2image29323456 page2image29324032 page2image29322688page2image29317312

STAD37
2. Assume x = ??x1

Final Exam
x2 x3??follows the normal distribution N3(μ, Σ) with

?2? ?1 1 1? μ=?1? and Σ=?1 4 2?

Page 3 of 20

page3image29116096

.

?3 123 (a) (4 points) Find the distribution of 3x1 + x2 ? 2x3.

??c1 ??
(b) (4points) Assume that c = c is a 2×1 vector. In order to make x3 and

2

independent, what condition does c need to satisfy?

x3 ? c x

2

??x1??

page3image29116480

STAD37 Final Exam

Page 4 of 20

page4image29322112

??x1 ?x2??
(c) (6 points) Find the conditional distribution of x ? x given the value of x2.

23

page4image29310976

STAD37 Final Exam Page 5 of 20 3. Assume a symmetric 3 × 3 matrix, A, has the following normalized eigenvectors and

corresponding eigenvalues:

(a) (3 points) Write the original matrix A.

page5image29315968

(b) (2 points) Calculate the determinant of A.

(c) (3 points) Calculate A?1/2.

e1 =??0 0 1??, λ1 =16, e2=??3 4 0??, λ2=100,

55
e3=???54 35 0??, λ3=25.

page5image29314432

STAD37 Final Exam Page 6 of 20

page6image29224000

(d) (6 points) Prove that if a p × p symmetric matrix B is positive definite, then B?1 is also positive definite. (You may assume without proof that B?1 exists. Your proof should be based directly on the definition of positive definiteness, and may not use any theorems that mention positive definiteness.)

page6image29220544

STAD37 Final Exam Page 7 of 20

page7image29336768

4. Suppose that there are two independent random samples, which are from population π1 and population π2, respectively. The sample sizes of the two random samples are n1 = 41 and n2 = 61. Assume that population π1 follows the bivariate normal distribution of N2(μ1,Σ1) and population π2 follows the bivariate normal distribution of N2(μ2,Σ2). The sample means and sample covariance matrices are given by

??2?? ??2 1?? ??1?? ??4 1?? x ? 1 = 6 , S 1 = 1 5 , x ? 2 = 8 , S 2 = 1 3 .

For the following questions of (a)-(e), we assume that Σ1 = Σ2 = Σ.

(a) (5 points) Test H0 : μ1 = μ2 using the Hotelling’s T2 with the significance level α = 0.05. (Hint: If you can not find the exact value of critical value from the tables, use the closest value to instead)

page7image29335808

STAD37 Final Exam Page 8 of 20 (b) (4 points) Construct the 95% simultaneous T 2 confidence intervals for (μ11 ? μ21)

and (μ12 ? μ22). Note that μ1 = ??μ11 μ12??and μ2 = ??μ21 μ22??.

page8image29308608

(c) (6 points) Construct the estimated minimum ECM classification rule.

page8image29310528

STAD37

Final Exam Page 9 of 20

page9image21023936

(d)

(3 points) Assume equal prior probabilities, c(1|2) = 20 and c(2|1) = 100, allo- cate the observation x0 = ??3 7??to population π1 or population π2 based on the classification rule in (c).

(e)

(5 points) Assume equal prior probabilities, allocate the observation x0 = ??3 7??to population π1 or population π2 using the estimated Bayes’s classifer.

page9image21024320

STAD37 Final Exam Page 10 of 20

page10image21023360

(f) (8 points) Test H0 : Σ1 = Σ2 using Box’s test (Bartlett’s test) with the significance level α = 0.05. According to the test result, does the classification rule obtained in (c) need to be adjusted?

page10image21021248

STAD37 Final Exam Page 11 of 20

page11image29256768

5. Observations on two responses are collected for three treatments. The observations vectors ??x1 x2??are

??5?? ??10?? ??5?? ??8?? Treatment1: 6, 8 ,4,6,

??4?? ??5?? ??3?? Treatment2: 4 , 3 , 2 ,

??2?? ??5?? ??6?? ??3?? ??4?? Treatment3: 1,2,4,3,5.

(a) (10 points) Compute Wilks’ Lambda Λ? for testing the treatment effects.

page11image29252736

STAD37 Final Exam Page 12 of 20

page12image21020672

You can write here if the space is not enough for question (a)

page12image21032000

STAD37 Final Exam Page 13 of 20

page13image29249472

(b) (4 points) Use the test statistic about Wilks’ Lambda to test for treatment effects

with α = 0.05. (Hint: The test statistic is g?1 Λ? with N = ??l=1 nl

?????????
N?g?1 1?Λ g

page13image29252544

and g being the number of groups, the test statistic follows the distribution of F2(g?1),2(N?g?1) under H0).

page13image29257920

STAD37 Final Exam Page 14 of 20

page14image29321920

6. Consider the data on the national track records for men. The variables are: country, 100m (s), 200m (s), 400m (s), 800m (min), 1500m (min), 5000m (min), 10,000m (min) and Marathon (min). There are 54 countries. Part of the data is as follows

  > head(Track, 30)
            Country  100m 200m 400m 800m 1500m 5000m  10,000m  Marathon
                                                        27.6      130
                                                        27.5      128
                                                        27.7      132
                                                        26.9      127
                                                        30.5      146
                                                        28.1      126
                                                        27.6      130
                                                        28.1      132
                                                        28.2      129
                                                        27.9      131
                                                        35.4      171
                                                        28.8      133
                                                        27.8      132
                                                        27.9      129
                                                        30.4      146
                                                        27.5      131
                                                        27.4      126
                                                        27.4      128
1
2
3
4
5
6
7
8
9
10
11
12
13  CzechRepublic 10.24 20.6 45.8 1.75  3.58  13.4
14        Denmark 10.29 20.5 45.9 1.69  3.52  13.4
15 DominicanRepub 10.16 20.6 44.9 1.81  3.73  14.3
  Argentina 10.23 20.4 46.2 1.77  3.68  13.3
  Australia  9.93 20.1 44.4 1.74  3.53  12.9
    Austria 10.15 20.4 45.8 1.77  3.58  13.3
    Belgium 10.14 20.2 45.0 1.73  3.57  12.8
    Bermuda 10.27 20.3 45.3 1.79  3.70  14.6
     Brazil 10.00 19.9 44.3 1.70  3.57  13.5
     Canada  9.84 20.2 44.7 1.75  3.53  13.2
      Chile 10.10 20.1 45.9 1.76  3.65  13.4
      China 10.17 20.4 45.2 1.77  3.61  13.4
   Columbia 10.29 20.9 45.8 1.80  3.72  13.5
CookIslands 10.97 22.5 51.4 1.94  4.24  16.7
  CostaRica 10.32 21.0 46.4 1.87  3.84  13.8
Finland 10.21 20.5 45.5 1.74  3.61  13.3
 France 10.02 20.2 44.6 1.72  3.48  13.0
Germany 10.06 20.2 44.3 1.73  3.53  12.9
16
17
18
19   GreatBritain  9.87 19.9 44.4 1.70  3.49  13.0    27.3      127
20
21
22
23
24
25
26
27
28
29
30
     Greece 10.11 19.9 45.6 1.75  3.61  13.5
  Guatemala 10.32 21.1 48.4 1.82  3.74  14.0
    Hungary 10.08 20.1 45.4 1.76  3.59  13.4
      India 10.33 20.7 45.5 1.76  3.63  13.5
  Indonesia 10.20 20.9 46.4 1.83  3.77  14.2
    Ireland 10.35 20.5 45.6 1.75  3.56  13.1
     Israel 10.20 20.9 46.6 1.80  3.70  13.7
      Italy 10.01 19.7 45.3 1.73  3.55  13.1
      Japan 10.00 20.0 44.8 1.77  3.62  13.2
      Kenya 10.28 20.4 44.2 1.70  3.44  12.7
Korea,South 10.34 20.4 45.4 1.74  3.64  13.8
28.1      132
29.3      133
28.0      132
28.8      132
29.6      139
27.8      129
28.7      134
27.3      127
27.6      126
26.5      125
28.5      127
page14image29311360

STAD37 Final Exam Page 15 of 20

page15image29260416

Answer the following questions according to the R output given after the questions.

  1. (a) ?(2 points) How many principal components to retain if the proportion of total stan- dardized sample variance explained by the principal components should be at least 90% but not larger than 94%?

  2. (b) ?(4 points) Provide interpretations for the first and second principal components.

(c) (4 points) The factor analysis using the maximum likelihood method is given in the output. The factors are rotated by using the “varimax rotation”. Interpret the two “rotated” factors.

(d) (2 points) Based on the estimated two-factor model, calcualte the communality for the variable “Marathon”.

page15image29255616

STAD37

Final Exam Page 16 of 20

page16image29317504

(e)

(2 points) For the maximum likelihood method, are the two common factors suf- ficient based on the likelihood ratio test using Bartlett’s correction at the level α = 0.01? (Hint: Use the R output directly)

(f)

(3 points) Figure 1 shows the scatter plot for the factor scores. Are there any outliers in the data? Explain the reason.

Kenya Ireland
Spain CzechRepublic

MexicoFinlaDndenmark AArguesntrtiiana

India

Taiwan

Korea,South

Chile

Malaysia Indonesia

PapuaNewGuinea Singapore

DominicanRepub

Bermuda Mauritius

Belgium

Myanmar(Burma) Korea,NorPthilippines

CookIslands

Samoa

Turkey

Columbia Romania

Guatemala CostaRica

NewZeSawlaenden Switzerland China

Netherlands Germany

France Russia Portugal

Thailand

Luxembourg Israel

Norway

AustraliCJaapnanda

GreatBritain Italy

Hungary

Poland
Greece

Brazil

U.S.A.

?1 0 1 2 3 Factor 1

Figure 1: Scatter plot for the factor scores

page16image29320960

Factor 2
?2 ?1 0 1 2

STAD37

Final Exam

Page 17 of 20

page17image21020864

R output for Question 6:

> Y <- Track[ , -1]
> Track.pca = prcomp(Y, scale=TRUE)
> summary(Track.pca)
Importance of components:
                        PC1    PC2    PC3
Standard deviation     2.589 0.7990 0.4770 0.4537 0.3124
Proportion of Variance 0.838 0.0798 0.0284 0.0257 0.0122
Cumulative Proportion  0.838 0.9177 0.9462 0.9719 0.9841
                        PC6     PC7     PC8
Standard deviation     0.26587 0.21666 0.09858
Proportion of Variance 0.00884 0.00587 0.00121
Cumulative Proportion  0.99292 0.99879 1.00000
> Track.pca$rotation
           PC1     PC2      PC3     PC4     PC5     PC6
100m     -0.332 -0.5294 -0.34386 -0.3807  0.2997 -0.3620
200m     -0.346 -0.4704  0.00379 -0.2170 -0.5414  0.3486
400m     -0.339 -0.3453  0.06706  0.8513  0.1330  0.0771
800m     -0.353  0.0895  0.78271 -0.1343 -0.2273 -0.3413
1500m    -0.366  0.1537  0.24427 -0.2330  0.6516  0.5298
5000m    -0.370  0.2948 -0.18286  0.0546  0.0718 -0.3591
10,000m  -0.366  0.3336 -0.24398  0.0871 -0.0613 -0.2731
Marathon -0.354  0.3866 -0.33463 -0.0181 -0.3379  0.3752
           PC7      PC8
          0.348 -0.06570
         -0.440  0.06076
          0.114 -0.00347
          0.259 -0.03927
         -0.147 -0.03975
         -0.328  0.70568
10,000m  -0.351 -0.69718
Marathon  0.594  0.06932
> Track.factor <- factanal(Y, 2, cor=T, rotation="varimax",
scores="regression")
> Track.factor
Call:
factanal(x = Y, factors = 2, scores = "regression", rotation = "varimax",
cor = T)
100m
200m
400m
800m
1500m
5000m
Uniquenesses:
 100m     200m     400m     800m
0.135    0.037    0.228    0.212
10,000m Marathon
0.011    0.088
1500m    5000m
0.134    0.012

PC4 PC5

page17image21022976

STAD37

Final Exam

Page 18 of 20

page18image29341376
Loadings:
        Factor1 Factor2
100m     0.397
200m     0.404
400m     0.511
800m     0.667
1500m    0.745
5000m    0.883
10,000m  0.897   0.429
Marathon 0.863
SS loadings
Proportion Var
Cumulative Var

0.410

Factor1 Factor2
 3.912   3.231
 0.489   0.404
 0.489   0.893
0.841
0.894
0.714
0.585
0.558
0.455
Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 25.9 on 13 degrees of freedom.
The p-value is 0.0173
> Track.scores <- Track.factor$scores
> Track.df <- data.frame(Y, Track.scores)
> rownames(Track.df) = Track[ ,1]
> attach(Track.df)
> plot(Factor1, Factor2, type=‘n’, xlab=‘Factor 1’, ylab=‘Factor 2’)
> text(Factor1, Factor2, labels = row.names(Track.df), cex = 0.7, col="blue")
page18image29340992

STAD37 Final Exam Page 19 of 20

page19image29336192page19image27676528page19image29334272

STAD37 Final Exam Page 20 of 20

page20image29321152page20image27944704page20image29326912