Uploaded by i2ulboyz_el

03 Statistika Lanjut - Principle Component Analysis

advertisement
PCA
Philosophy of PCA
• Introduced by Pearson (1901) and
Hotelling (1933) to describe the variation
in a set of multivariate data in terms of a
set of uncorrelated variables
• We typically have a data matrix of n
observations on p correlated variables
x1,x2,…xp
• PCA looks for a transformation of the xi
into p new variables yi that are
uncorrelated
Philosophy of PCA
• An orthogonal linear transformation
transforms the data to a new coordinate
system such that the greatest variance
comes to lie on the first coordinate, the
second greatest variance on the second
coordinate and so on.
Philosophy of PCA
• The objectives of principal
components analysis are
- data reduction
- interpretation
• The results of principal components
analysis are often used as inputs to
- regression analysis
- cluster analysis
• Suppose we have a population measured on p
random variables X1,…,Xp. Note that these
random variables represent the p-axes of the
Cartesian coordinate system in which the
population resides.
• Our goal is to develop a new set of p axes (linear
combinations of the original p axes) in the
directions of greatest variability:
X2
X1
This is accomplished by rotating the axes.
Consider our random vector
 X1 
X 
2
X =  
 
 
 X p 
with covariance matrix S
~ and eigenvalues l1  l2    lp.
We can construct p linear combinations
Y1 = a'1 X = a11 X 1 + a12 X 2 +
+ a1p X p
Y2 = a'2 X = a21 X 1 + a22 X 2 +
+ a2p X p
Yp = a'p X = a p1 X 1 + a p2 X 2 +
+ a pp X p
It is easy to show that
Var  Yi  = a'iΣai, i = 1,
,p
Cov  Yi, Yk  = a'iΣak, i, k = 1,
,p
The principal components are those uncorrelated linear
combinations Y1,…,Yp whose variances are as large as
possible.
Thus the first principal component is the linear
combination of maximum variance, i.e., we wish to solve
the nonlinear optimization problem
source of
nonlinearity
max a'1Σa1
a1
st
a'1a1 = 1
restrict to
coefficient vectors
of unit length
The second principal component is the linear
combination of maximum variance that is uncorrelated
with the first principal component, i.e., we wish to solve
the nonlinear optimization problem
max a'2Σa2
a2
st
'
2
a a2 = 1
a'1Σa2 = 0
restricts
covariance
to zero
The third principal component is the solution to the
nonlinear optimization problem
max a'3Σa3
a3
st
a'3a3 = 1
'
1
'
2
a Σa3 = 0
a Σa3 = 0
restricts
covariances
to zero
Generally, the ith principal component is the linear
combination of maximum variance that is uncorrelated
with all previous principal components, i.e., we wish to
solve the nonlinear optimization problem
max a'iΣai
ai
st
'
i
'
k
a ai = 1
a Σai = 0 k < i
We can show that, for random vector X with covariance
~
matrix S~ and eigenvalues l1  l2    lp  0, the ith
principal component is given by
Yi = e'iX = e'i1X1 + e'i2X 2 +
+ e'ip X p, i = 1,
Note that the principal components are not unique if
some eigenvalues are equal.
,p
Generally, the ith principal component is the linear
combination of maximum variance that is uncorrelated
with all previous principal components, i.e., we wish to
solve the nonlinear optimization problem
max a'iΣai
ai
st
'
i
'
k
a ai = 1
a Σai = 0 k < i
We can show that, for random vector X with covariance
~
matrix S~ and eigenvalues l1  l2    lp  0, the ith
principal component is given by
Yi = e'iX = e'i1X1 + e'i2X 2 +
+ e'ip X p, i = 1,
Note that the principal components are not unique if
some eigenvalues are equal.
,p
We can also show for random vector X with covariance
~
matrix S and eigenvalue-eigenvector pairs (l1 , e1), …, (lp ,
~
~
e~p) where l1  l2    lp,
p
σ11 +
+ σ pp =
 Var  X 
i
p
= λ1 +
i =1
+ λp =
 Var  Y 
i
i =1
so we can assess how well a subset of the principal
components Yi summarizes the original random variables
Xi – one common method of doing so is
λk
p
λ
i=1
i
proportion of total
population variance
due to the kth
principal component
If a large proportion of the total population variance can
be attributed to relatively few principal components, we
can replace the original p variables with these principal
components without loss of much information!
We can also easily find the correlations between the
original random variables Xk and the principal
components Yi:
ρYi,Xk =
eik λi
σkk
These values are often used in interpreting the principal
components Yi.
Example: Suppose we have the following population of
four observations made on three random variables X1, X2,
and X3:
X1
1.0
4.0
3.0
4.0
X2
6.0
12.0
12.0
10.0
X3
9.0
10.0
15.0
12.0
Find the three population principal components Y1, Y2,
and Y3:
First we need the covariance matrix S:
~
1.50 2.50 1.00


Σ = 2.50 6.00 3.50
1.00 3.50 5.25
and the corresponding eigenvalue-eigenvector pairs:
 0.2910381


λ1 = 9.9145474, e1 =  0.7342493
 0.6133309
 0.4150386


λ2 = 2.5344988, e2 =  0.4807165
-0.7724340
 0.8619976


λ3 = 0.3009542, e3 = -0.4793640 
 0.1648350
so the principal components are:
Y1 = e'1X = 0.2910381X 1 + 0.7342493X 2 + 0.6133309X 3
Y2 = e'2X = 0.4150386X 1 + 0.4807165X 2 - 0.7724340X 3
Y3 = e'3 X = 0.8619976X 1 - 0.4793640X 2 + 0.1648350X 3
and the proportion of total population variance due to
the each principal component is
λ1
9.9145474
=
= 0.777611529
17.0
p
λ
i
i=1
λ2
2.5344988
=
= 0.198784220
17.0
p
λ
i
i=1
λ3
0.3009542
=
= 0.023604251
17.0
p
λ
i
i=1
Note that the third principal component is relatively
irrelevant!
Next we obtain the correlations between the original
random variables Xi and the principal components Yi:
ρY1,X1 =
ρY1,X2 =
ρY1,X3 =
ρY2,X1 =
ρY2,X2 =
e11 λ1
σ11
e21 λ1
σ22
e31 λ1
σ33
e12 λ2
σ11
e22 λ2
σ21
0.2910381 9.9145474
=
= 0.610935027
1.50
0.7342493 9.9145474
=
= 0.385326368
6.00
0.6133309 9.9145474
=
= 0.367851033
5.25
0.4150386 2.5344988
=
= 0.440497325
1.50
0.4807165 2.5344988
=
= 0.127550987
6.00
ρY2,X3 =
ρY3,X2 =
ρY3,X3 =
e32 λ2
σ33
e23 λ3
σ22
e33 λ3
σ33
-0.7724340 2.5344988
=
= -0.234233023
5.25
-0.4793640 0.3009542
=
= -0.043829283
6.00
0.1648350 0.3009542
=
= 0.017224251
5.25
We can display these results in a correlation matrix:
Y1
Y2
Y3
X1
X2
X3
0.6109350 0.3853264 0.3678510
0.4404973 0.1275510 -0.2342330
0.3152572 -0.0438293 0.0172243
First vs second PC’s
0.5
X1
0.4
0.3
Y2
0.2
X2
0.1
0
-0.1 0
0.1
0.2
0.3
0.4
-0.2
0.5
0.6
0.7
X3
-0.3
Y1
The first principal component (Y1) is a mixture of all three random
variables (X1, X2, and X3)
The second principal component (Y2) is a trade-off between X1 and X3
First vs Third PC’s
0.35
X1
0.3
0.25
Y3
0.2
0.15
0.1
0.05
X3
0
-0.05 0
0.1
0.2
0.3
0.4
X2
0.5
0.6
-0.1
Y1
The third principal component (Y3) is a residual of X1
0.7
Merk
Fuel
Easy
Service Value Price Design Sporty Safety
Cons.
Handling
Audi
4
3
2
4
3
3
2
3
BMW
5
2
2
5
2
3
2
3
Citroen
3
4
4
3
4
4
4
3
Ferrari
5
3
2
6
2
1
3
4
Fiat
2
4
4
3
5
4
4
2
Jaguar
5
2
2
6
1
2
3
4
Lada
3
4
4
2
4
5
5
3
Mercedes
4
2
2
5
2
3
1
2
Opel
2
2
3
3
3
4
3
2
Rover
4
3
3
4
3
3
3
3
Trabant
4
5
6
2
4
6
6
3
VW
2
2
2
3
3
3
3
2
1
1
1
1
1
1
1
1
Best
Value
VW
Opel
Mercedes
Fiat
Audi
PC 2
BMW
Rover
Citroen
Lada
Jaguar
Ferari
Trabant
PC 1
Design
Sport
PC 2
Price
Service
Value
Safety
Easy Handling
Fuel
PC 1
VW
Opel
Mercedes
Design
Audi
PC 2
BMW
Jaguar
Ferari
Rover
Price
Service
Fiat
Sport
Citroen
Lada
Value
Safety
Easy Handling
Fuel
PC 1
Trabant
• The first principal component explain 76% of the variability
of the data set.
• The first two principal components explain more than 90%
of the variability of the data set
• The third PC explains only 5% of the variability.
How many PC’s?
• Data from Bryce and Barker on the preliminary study of
possible link between football helmet design and neck
injuries. Six head measurement were made of each
subject.
1
EyeHD
0.8
0.6
0.4
EarHD
0.2
0
-0.2
-0.4
0
0.2 Wdim
0.4
Jaw
FBEye
0.6
0.8
1
Circum
Method 1
• The challenge lies in selecting an appropriate
threshold percentage.
• If we aim too high, we run the risk of including
components that are either sample specific or
variable specific.
• Sample specific: a component may not
generalize to the population or to other samples.
• A variable-specific: component is dominated by
a single variable and does not represent a
composite summary of several variables.
Method 1
Method 2
• Is widely used and is the default in many software
packages.
• It retains those components that account for more
variance the average variances of the data
Method 3
8
7
6
5
Eigenvalue
• Plot eigenvalues vs. number
of components
• Extract components up to
the point where the plot
“levels off”
• Right-hand tail of the curve
is “scree” (like lower part of
a rocky slope)
4
3
2
1
1
2
3
4
5
Component #
6
Method 3
Method 4
Method 4
Thank you
Download