Analyzing Real Data with LRMoE • LRMoE

Introduction

In this document, we will analyze a real dataset of claim frequency with LRMoE. We will compare how it performs compared with the classical Poisson generalized linear model (GLM).

Overview of Dataset

We will use the ausprivauto0405 dataset available in the R package CASdatasets.

library(CASdatasets)
data(ausprivauto0405)

The dataset contains some typical covariates used in predicting auto-insurance claim frequency and severity. For simplicity, we will only consider the problem of modelling claim frequency. A brief summary of the dataset is given below.

summary(ausprivauto0405)

##     Exposure           VehValue                VehAge     
##  Min.   :0.002738   Min.   : 0.000   old cars     :20064  
##  1st Qu.:0.219028   1st Qu.: 1.010   oldest cars  :18948  
##  Median :0.446270   Median : 1.500   young cars   :16587  
##  Mean   :0.468651   Mean   : 1.777   youngest cars:12257  
##  3rd Qu.:0.709103   3rd Qu.: 2.150                        
##  Max.   :0.999316   Max.   :34.560                        
##                                                           
##           VehBody         Gender                    DrivAge     
##  Sedan        :22233   Female:38603   old people        :10736  
##  Hatchback    :18915   Male  :29253   older work. people:16189  
##  Station wagon:16261                  oldest people     : 6547  
##  Utility      : 4586                  working people    :15767  
##  Truck        : 1750                  young people      :12875  
##  Hardtop      : 1579                  youngest people   : 5742  
##  (Other)      : 2532                                            
##     ClaimOcc          ClaimNb         ClaimAmount     
##  Min.   :0.00000   Min.   :0.00000   Min.   :    0.0  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:    0.0  
##  Median :0.00000   Median :0.00000   Median :    0.0  
##  Mean   :0.06814   Mean   :0.07276   Mean   :  137.3  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:    0.0  
##  Max.   :1.00000   Max.   :4.00000   Max.   :55922.1  
##

Poisson GLM

As a benchmark, we will first fit a Poisson GLM (with log link) to claim frequency. Note that Exposure represents how long the policy has been in force / observed, which is used as an offset term in model fitting. The usual model summary is also given below.

glm_fit = glm(ClaimNb ~ Gender + DrivAge +
            VehBody + VehAge + VehValue,
            offset = Exposure,
            family = poisson(link = "log"),
            data = ausprivauto0405)

summary(glm_fit)

## 
## Call:
## glm(formula = ClaimNb ~ Gender + DrivAge + VehBody + VehAge + 
##     VehValue, family = poisson(link = "log"), data = ausprivauto0405, 
##     offset = Exposure)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8211  -0.4128  -0.3587  -0.3153   4.7158  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -2.413863   0.320949  -7.521 5.44e-14 ***
## GenderMale                -0.016251   0.030083  -0.540 0.589057    
## DrivAgeolder work. people  0.208764   0.048913   4.268 1.97e-05 ***
## DrivAgeoldest people       0.005946   0.064328   0.092 0.926357    
## DrivAgeworking people      0.235899   0.049037   4.811 1.50e-06 ***
## DrivAgeyoung people        0.281227   0.050684   5.549 2.88e-08 ***
## DrivAgeyoungest people     0.450164   0.059049   7.624 2.47e-14 ***
## VehBodyConvertible        -1.864772   0.668202  -2.791 0.005259 ** 
## VehBodyCoupe              -0.647945   0.336949  -1.923 0.054483 .  
## VehBodyHardtop            -0.864757   0.327828  -2.638 0.008344 ** 
## VehBodyHatchback          -1.018979   0.318260  -3.202 0.001366 ** 
## VehBodyMinibus            -1.083799   0.350034  -3.096 0.001960 ** 
## VehBodyMotorized caravan  -0.461968   0.409251  -1.129 0.258976    
## VehBodyPanel van          -0.850211   0.338824  -2.509 0.012097 *  
## VehBodyRoadster           -0.674147   0.659735  -1.022 0.306854    
## VehBodySedan              -0.974533   0.317675  -3.068 0.002157 ** 
## VehBodyStation wagon      -0.973794   0.317997  -3.062 0.002197 ** 
## VehBodyTruck              -1.015281   0.328322  -3.092 0.001986 ** 
## VehBodyUtility            -1.189603   0.322057  -3.694 0.000221 ***
## VehAgeoldest cars         -0.051289   0.040946  -1.253 0.210354    
## VehAgeyoung cars           0.105528   0.039636   2.662 0.007757 ** 
## VehAgeyoungest cars       -0.004183   0.048134  -0.087 0.930754    
## VehValue                   0.034926   0.016855   2.072 0.038248 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 25716  on 67855  degrees of freedom
## Residual deviance: 25561  on 67833  degrees of freedom
## AIC: 35042
## 
## Number of Fisher Scoring iterations: 6

Data Transformation

In order to fit an LRMoE, we first need to transform the dataset. Since DriverAge, VehBody and VehAge are factors, we will use model.matrix to augment a dummy matrix of zeros and ones to represent them.

# 1: Intercept
df = data.frame(Intercept = rep(1, nrow(ausprivauto0405)))
# 2: Gender
df = cbind(df, GenderMale = model.matrix(~Gender, data = ausprivauto0405)[,2])
# 3-7: DrivAge
df = cbind(df, model.matrix(~DrivAge, data = ausprivauto0405)[,2:6])
# 8-19: VehBody
df = cbind(df, model.matrix(~VehBody, data = ausprivauto0405)[,2:13])
# 20-22: VehAge
df = cbind(df, VehAge = model.matrix(~VehAge, data = ausprivauto0405)[,2:4])
# 23: VehValue
df =  cbind(df, VehValue = ausprivauto0405$VehValue)


# 24: Exposure
df = cbind(df, Exposure = ausprivauto0405$Exposure)


# 25: ClaimNb
df = cbind(df, ClaimNb = ausprivauto0405$ClaimNb)

df = as.matrix(df)

Model Initialization

When fitting an LRMoE model, a good starting point usually reduces run time and can potentially land on a better fit. In the LRMoE package, we provide a function for parameter initialization based on the Clustered Method-of-Moments (CMM).

The following demonstrates how to initialize a 3-component LRMoE model.

set.seed(2021)
X = as.matrix(df[,1:23])
exposure = as.matrix(df[,24])
Y = as.matrix(df[,25])

init = LRMoE::cmm_init(Y = Y, X = X, n_comp = 3, 
                       type = c("discrete"), exact_Y = TRUE)

The init object contains a list of summary statistics of Y, as well as a set of parameter initialization. Let us consider a 3-component Poisson mixture.

# Initialization of alpha
alpha_init = init$alpha_init
alpha_init

##           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 1.9156411    0    0    0    0    0    0    0    0     0     0     0     0
## [2,] 0.9463612    0    0    0    0    0    0    0    0     0     0     0     0
## [3,] 0.0000000    0    0    0    0    0    0    0    0     0     0     0     0
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
## [1,]     0     0     0     0     0     0     0     0     0     0
## [2,]     0     0     0     0     0     0     0     0     0     0
## [3,]     0     0     0     0     0     0     0     0     0     0

# Initialization of component distributions
# The list indices are specified as:
# [[1]]: the 1st dimension of the response of Y
# [[1]]/[[2]]/[[3]]: the 1st, 2nd and 3rd latent component/group
# [[5]]: the 5th ponential choice of expert is zipoisson
comp_dist = matrix(list(init$params_init[[1]][[1]][[1]]$distribution,
                        init$params_init[[1]][[2]][[1]]$distribution,
                        init$params_init[[1]][[3]][[1]]$distribution),
                   nrow = 1, byrow = TRUE)
params_init = matrix(list(init$params_init[[1]][[1]][[1]]$params,
                          init$params_init[[1]][[2]][[1]]$params,
                          init$params_init[[1]][[3]][[1]]$params),
                     nrow = 1, byrow = TRUE)

If interested, we can also check some summary statistics of Y based on the initialization. For example, the proportion of zeros and the mean of positive Y’s are given below. These summary statistics may help with selecting the appropriate expert functions for each latent component/group.

init$zero_y

##           [,1]     [,2]      [,3]
## [1,] 0.9303021 0.931147 0.9442322

init$mean_y_pos

##          [,1]     [,2]     [,3]
## [1,] 1.073273 1.052541 1.068493

Poisson LRMoE

With an initialization of model, we can now fit the LRMoE with zero-inflated Poisson experts. Note that exposure is also taken into account. In the case of Poisson distribution with parameter \(\lambda\), a policy with exposure \(E_i\) would have claim frequency distribution \(Poisson(\lambda E_i)\). The same rate parameter \(\lambda\) is shared by all policyholders who have potentially different policy exposures.

(Note: The fitting function takes more than 30 minutes to run. Adjusting parameters such as a larger eps or smaller ecm_iter_max will provide a quicker, but potentially worse, model fit.)

LRMoE_fit = LRMoE::FitLRMoE(Y = Y, X = X,
                            alpha_init = alpha_init,
                            comp_dist = comp_dist, params_list = params_init,
                            exposure = exposure,
                            exact_Y = TRUE,
                            eps = 0.05, ecm_iter_max = 25)

We can examine the fitted model as follows, as well as its loglikelihood, AIC and BIC. Note that, although we are fitting the same dataset, the result above is different from that presented in the Actuary Magazine. This is most likely due to a different initialization as well as stopping conditions. In practice, such differences are unlikely to have a material impact, as long as the overall goodness-of-fit is similar for these different models.

LRMoE_fit$alpha_fit

##           [,1]         [,2]       [,3]        [,4]       [,5]       [,6]
## [1,] 2.0585699 -0.005688619 0.05877480 0.005617090 0.06715980 0.08454027
## [2,] 0.8942683 -0.003519320 0.04131673 0.004095807 0.04718328 0.05913152
## [3,] 0.0000000  0.000000000 0.00000000 0.000000000 0.00000000 0.00000000
##            [,7]       [,8]        [,9]       [,10]       [,11]        [,12]
## [1,] 0.14109229 -0.3651218 -0.09086990 -0.18098895 -0.22567502 -0.227055362
## [2,] 0.09868725 -0.1000964  0.08362063  0.02455771 -0.00649267 -0.007299331
## [3,] 0.00000000  0.0000000  0.00000000  0.00000000  0.00000000  0.000000000
##            [,13]       [,14]      [,15]         [,16]        [,17]        [,18]
## [1,] -0.02089958 -0.19311414 -0.1066918 -0.2138805707 -0.210504745 -0.227362371
## [2,]  0.12861213  0.01369662  0.0543942  0.0006437923  0.003129555 -0.009435871
## [3,]  0.00000000  0.00000000  0.0000000  0.0000000000  0.000000000  0.000000000
##            [,19]       [,20]      [,21]      [,22]       [,23]
## [1,] -0.26573563 -0.01533255 0.03216934 0.01467385 0.008048246
## [2,] -0.03527813 -0.01064473 0.02187543 0.01022420 0.005992822
## [3,]  0.00000000  0.00000000 0.00000000 0.00000000 0.000000000

LRMoE::print_expert_matrix(LRMoE_fit$model_fit)

## [1] "Dimension  1  Component  1 :"
## [1] "poisson :  lambda=0.170144405986577"
## [1] "Dimension  1  Component  2 :"
## [1] "poisson :  lambda=0.145955830488811"
## [1] "Dimension  1  Component  3 :"
## [1] "poisson :  lambda=0.0771593490653163"

LRMoE_fit$ll

## [1] -17473.37

LRMoE_fit$AIC

## [1] 35028.25

LRMoE_fit$BIC

## [1] 35475.38