YOGYUI

R caret::varImp - 변수중요도 측정 본문

Software/R

R caret::varImp - 변수중요도 측정

요겨 2021. 6. 18. 13:10
반응형

 

랜덤포레스트 분석모델을 구현했을 때, 모델의 특성을 파악하는데 사용되는 방법 중 하나가 '변수 중요도(variable importance)'이다

변수중요도를 측정하는 지표는 Mean Decrease Gini (평균 지니불순도 감소량)이며, 랜덤포레스트 구축 시 노드를 늘려감에 따라 데이터의 변수(속성)가 지니불순도 감소량에 얼마나 영향을 미치는 지를 계산하게 되며, 감소량이 클수록 학습 시 중요한 변수로 작용했다고 할 수 있다

 

지니불순도와 관련된 자세한 내용은 다음 글을 참고하도록 한다

https://m.blog.naver.com/PostView.naver?isHttpsRedirect=true&blogId=sungmk86&logNo=221204932461

 

Mean Decrease Gini

결론 : Random forest model에서의 feature의 중요도를 측정한 값이다. Mean Decrease Gini (MDG) ...

blog.naver.com

 

랜덤포레스트 관련 패키지 randomForest 에는 변수중요도를 확인할 수 있는 함수가 여러 개 있는데, 그 중에서 다음 두 개를 가장 많이 활용한다

library(randomForest)
randomForest::varImpPlot
randomForest::importance

varImpPlot은 시각화, importance는 실제 Mean Decrease Gini 값을 변수별로 요약해서 볼 수 있다

 

예제를 통해 확인해보자

데이터는 피마 원주민의 당뇨병 데이터(PimaIndiansDiabetes)를 선택했다 (메타데이터: 링크)

>> 분류타겟 속성명: diabetes, factor형 ('neg', 'pos')

(피마: 미국 애리조나 남부 지명)

library(mlbench)
data(PimaIndiansDiabetes)

# hold out
set.seed(210618)
library(caret)
idx <- caret::createDataPartition(PimaIndiansDiabetes$diabetes, p = 0.7)
df_train <- PimaIndiansDiabetes[idx$Resample1, ]
df_test <- PimaIndiansDiabetes[-idx$Resample1, ]

library(randomForest)
model_rf <- randomForest(diabetes ~ ., df_train)

 

randomForest::importance(model_rf)
> 
         MeanDecreaseGini
pregnant         20.66890
glucose          62.19365
pressure         22.63882
triceps          16.75059
insulin          17.84821
mass             41.02808
pedigree         33.18713
age              28.19659

randomForest 패키지의 importrance 함수를 통해 변수의 평균 지니불순도 감소량을 알 수 있다

분류 시 가장 큰 중요도를 가진 변수는 glucose(포도당)임을 알 수 있다

(정확한 속성 내용은 "2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test"... 경구 포도당 부하 검사 결과라고 하는데 자세히는 잘 모르겠다 ㅋㅋ 의사가 아니니...)

randomForest::varImpPlot(model_rf)

varImpPlot을 통해 변수별 지니불순도 감소량을 시각화해서 한 눈에 알아볼 수 있다


caret 패키지에도 동일한 기능을 수행하는 함수가 있다 (varImp)

랜덤포레스트 패키지보다 더 광범위하게 사용할 수 있는게 특징이다

 

help(?)로 함수 명령어 설명을 보면 다음과 같다

Calculation of variable importance for regression and classification models

Description

A generic method for calculating variable importance for objects produced by train and method specific methods

Usage

varImp(object, ...)

## S3 method for class 'bagEarth'
varImp(object, ...)

## S3 method for class 'bagFDA'
varImp(object, ...)

## S3 method for class 'C5.0'
varImp(object, ...)

## S3 method for class 'cubist'
varImp(object, weights = c(0.5, 0.5), ...)

## S3 method for class 'dsa'
varImp(object, cuts = NULL, ...)

## S3 method for class 'glm'
varImp(object, ...)

## S3 method for class 'glmnet'
varImp(object, lambda = NULL, ...)

## S3 method for class 'JRip'
varImp(object, ...)

## S3 method for class 'multinom'
varImp(object, ...)

## S3 method for class 'nnet'
varImp(object, ...)

## S3 method for class 'avNNet'
varImp(object, ...)

## S3 method for class 'PART'
varImp(object, ...)

## S3 method for class 'RRF'
varImp(object, ...)

## S3 method for class 'rpart'
varImp(object, surrogates = FALSE, competes = TRUE, ...)

## S3 method for class 'randomForest'
varImp(object, ...)

## S3 method for class 'gbm'
varImp(object, numTrees = NULL, ...)

## S3 method for class 'classbagg'
varImp(object, ...)

## S3 method for class 'regbagg'
varImp(object, ...)

## S3 method for class 'pamrtrained'
varImp(object, threshold, data, ...)

## S3 method for class 'lm'
varImp(object, ...)

## S3 method for class 'mvr'
varImp(object, estimate = NULL, ...)

## S3 method for class 'earth'
varImp(object, value = "gcv", ...)

## S3 method for class 'RandomForest'
varImp(object, ...)

## S3 method for class 'plsda'
varImp(object, ...)

## S3 method for class 'fda'
varImp(object, value = "gcv", ...)

## S3 method for class 'gam'
varImp(object, ...)

## S3 method for class 'Gam'
varImp(object, ...)

## S3 method for class 'train'
varImp(object, useModel = TRUE, nonpara = TRUE, scale = TRUE, ...)
Arguments

object	
an object corresponding to a fitted model
...	
parameters to pass to the specific varImp methods
weights	
a numeric vector of length two that weighs the usage of variables in the rule conditions and the usage in the linear models (see details below).
cuts	
the number of rule sets to use in the model (for partDSA only)
lambda	
a single value of the penalty parameter
surrogates	
should surrogate splits contribute to the importance calculation?
competes	
should competing splits contribute to the importance calculation?
numTrees	
the number of iterations (trees) to use in a boosted tree model
threshold	
the shrinkage threshold (pamr models only)
data	
the training set predictors (pamr models only)
estimate	
which estimate of performance should be used? See mvrVal
value	
the statistic that will be used to calculate importance: either gcv, nsubsets, or rss
useModel	
use a model based technique for measuring variable importance? This is only used for some models (lm, pls, rf, rpart, gbm, pam and mars)
nonpara	
should nonparametric methods be used to assess the relationship between the features and response (only used with useModel = FALSE and only passed to filterVarImp).
scale	
should the importance values be scaled to 0 and 100?
Details

For models that do not have corresponding varImp methods, see filterVarImp.

Otherwise:

Linear Models: the absolute value of the t-statistic for each model parameter is used.

glmboost and glmnet: the absolute value of the coefficients corresponding the the tuned model are used.

Random Forest: varImp.randomForest and varImp.RandomForest are wrappers around the importance functions from the randomForest and party packages, respectively.

Partial Least Squares: the variable importance measure here is based on weighted sums of the absolute regression coefficients. The weights are a function of the reduction of the sums of squares across the number of PLS components and are computed separately for each outcome. Therefore, the contribution of the coefficients are weighted proportionally to the reduction in the sums of squares.

Recursive Partitioning: The reduction in the loss function (e.g. mean squared error) attributed to each variable at each split is tabulated and the sum is returned. Also, since there may be candidate variables that are important but are not used in a split, the top competing variables are also tabulated at each split. This can be turned off using the maxcompete argument in rpart.control. This method does not currently provide class-specific measures of importance when the response is a factor.

Bagged Trees: The same methodology as a single tree is applied to all bootstrapped trees and the total importance is returned

Boosted Trees: varImp.gbm is a wrapper around the function from that package (see the gbm package vignette)

Multivariate Adaptive Regression Splines: MARS models include a backwards elimination feature selection routine that looks at reductions in the generalized cross-validation (GCV) estimate of error. The varImp function tracks the changes in model statistics, such as the GCV, for each predictor and accumulates the reduction in the statistic when each predictor's feature is added to the model. This total reduction is used as the variable importance measure. If a predictor was never used in any of the MARS basis functions in the final model (after pruning), it has an importance value of zero. Prior to June 2008, the package used an internal function for these calculations. Currently, the varImp is a wrapper to the evimp function in the earth package. There are three statistics that can be used to estimate variable importance in MARS models. Using varImp(object, value = "gcv") tracks the reduction in the generalized cross-validation statistic as terms are added. However, there are some cases when terms are retained in the model that result in an increase in GCV. Negative variable importance values for MARS are set to zero. Alternatively, using varImp(object, value = "rss") monitors the change in the residual sums of squares (RSS) as terms are added, which will never be negative. Also, the option varImp(object, value =" nsubsets"), which counts the number of subsets where the variable is used (in the final, pruned model).

Nearest shrunken centroids: The difference between the class centroids and the overall centroid is used to measure the variable influence (see pamr.predict). The larger the difference between the class centroid and the overall center of the data, the larger the separation between the classes. The training set predictions must be supplied when an object of class pamrtrained is given to varImp.

Cubist: The Cubist output contains variable usage statistics. It gives the percentage of times where each variable was used in a condition and/or a linear model. Note that this output will probably be inconsistent with the rules shown in the output from summary.cubist. At each split of the tree, Cubist saves a linear model (after feature selection) that is allowed to have terms for each variable used in the current split or any split above it. Quinlan (1992) discusses a smoothing algorithm where each model prediction is a linear combination of the parent and child model along the tree. As such, the final prediction is a function of all the linear models from the initial node to the terminal node. The percentages shown in the Cubist output reflects all the models involved in prediction (as opposed to the terminal models shown in the output). The variable importance used here is a linear combination of the usage in the rule conditions and the model.

PART and JRip: For these rule-based models, the importance for a predictor is simply the number of rules that involve the predictor.

C5.0: C5.0 measures predictor importance by determining the percentage of training set samples that fall into all the terminal nodes after the split. For example, the predictor in the first split automatically has an importance measurement of 100 percent since all samples are affected by this split. Other predictors may be used frequently in splits, but if the terminal nodes cover only a handful of training set samples, the importance scores may be close to zero. The same strategy is applied to rule-based models and boosted versions of the model. The underlying function can also return the number of times each predictor was involved in a split by using the option metric = "usage".

Neural Networks: The method used here is based on Gevrey et al (2003), which uses combinations of the absolute values of the weights. For classification models, the class-specific importances will be the same.

Recursive Feature Elimination: Variable importance is computed using the ranking method used for feature selection. For the final subset size, the importances for the models across all resamples are averaged to compute an overall value.

Feature Selection via Univariate Filters, the percentage of resamples that a predictor was selected is determined. In other words, an importance of 0.50 means that the predictor survived the filter in half of the resamples.

Value

A data frame with class c("varImp.train", "data.frame") for varImp.train or a matrix for other models.

Author(s)

Max Kuhn

References

Gevrey, M., Dimopoulos, I., & Lek, S. (2003). Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecological Modelling, 160(3), 249-264.

Quinlan, J. (1992). Learning with continuous classes. Proceedings of the 5th Australian Joint Conference On Artificial Intelligence, 343-348.

randomForest 모델 뿐만 아니라 lm, glm, nnet, part, gbm 등 다양한 분석 모델들의 변수 중요도를 측정해준다

물론 각 모델별 판별 지표는 다른데, 예를 들어 선형모델은 각 변수의 적합 후 t-통계량을 기반으로 계산한다 (자세한 건 위 설명 참고)

 

바로 사용해보자

caret::varImp(model_rf)
>
          Overall
pregnant 20.66890
glucose  62.19365
pressure 22.63882
triceps  16.75059
insulin  17.84821
mass     41.02808
pedigree 33.18713
age      28.19659

randomForest::importance와 결과가 동일한 것을 알 수 있다

 

다른 분석모델도 만든 뒤 변수중요도를 측정해보자

model_glm <- glm(diabetes ~ ., df_train, family = "binomial")
summary(model_glm)
>
Call:
glm(formula = diabetes ~ ., family = "binomial", data = df_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7123  -0.7151  -0.3962   0.6976   2.8447  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.871915   0.893641  -9.928  < 2e-16 ***
pregnant     0.163505   0.041301   3.959 7.53e-05 ***
glucose      0.038042   0.004559   8.345  < 2e-16 ***
pressure    -0.014477   0.006364  -2.275  0.02291 *  
triceps     -0.002330   0.008531  -0.273  0.78480    
insulin     -0.002001   0.001184  -1.690  0.09103 .  
mass         0.108039   0.019265   5.608 2.05e-08 ***
pedigree     1.108171   0.374094   2.962  0.00305 ** 
age         -0.001875   0.012161  -0.154  0.87744    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 696.28  on 537  degrees of freedom
Residual deviance: 489.20  on 529  degrees of freedom
AIC: 507.2

Number of Fisher Scoring iterations: 5
caret::varImp(model_glm)
>
           Overall
pregnant 3.9588724
glucose  8.3448128
pressure 2.2749557
triceps  0.2730720
insulin  1.6899848
mass     5.6080088
pedigree 2.9622817
age      0.1542135

summary에서 계수 설명 중 'z value'의 절대값이 변수중요도 지표로 측정된 것을 알 수 있다

(z value 혹은 z score는 변수별 계수 추정치를 표준오차로 나눈 값이다)

 

rpart(의사결정나무) 등의 분석모델도 동일하게 적용가능하다

library(rpart)
model_tree <- rpart(diabetes ~ ., df_train)
caret::varImp(model_tree)
>
           Overall
age      51.893726
glucose  65.530681
insulin  15.480318
mass     66.204881
pedigree 28.854604
pregnant 38.896275
pressure 18.566101
triceps   7.767947

rpart의 경우 MSE 등 loss function의 감소량이 주요 판단 지표가 된다

opar <- par(mfrow = c(1, 1), xpd = NA)
plot(model_tree)
text(model_tree, use.n = TRUE)
par(opar)

모델의 성능 강화를 위해 변수 선택을 할 때 '변수중요도'는 참고하기 좋은 가이드라인이 되는데, caret 패키지의 varImp 함수는 랜덤포레스트 뿐만 아니라 다양한 모델에 적용 가능하기 때문에 여러모로 쓸모가 많을 것 같다

반응형
Comments