fitrealdata.Rmd
In this document, we will analyze a real dataset of claim frequency with LRMoE. We will compare how it performs compared with the classical Poisson generalized linear model (GLM).
We will use the ausprivauto0405
dataset available in the R package CASdatasets
.
library(CASdatasets)
data(ausprivauto0405)
The dataset contains some typical covariates used in predicting auto-insurance claim frequency and severity. For simplicity, we will only consider the problem of modelling claim frequency. A brief summary of the dataset is given below.
summary(ausprivauto0405)
## Exposure VehValue VehAge
## Min. :0.002738 Min. : 0.000 old cars :20064
## 1st Qu.:0.219028 1st Qu.: 1.010 oldest cars :18948
## Median :0.446270 Median : 1.500 young cars :16587
## Mean :0.468651 Mean : 1.777 youngest cars:12257
## 3rd Qu.:0.709103 3rd Qu.: 2.150
## Max. :0.999316 Max. :34.560
##
## VehBody Gender DrivAge
## Sedan :22233 Female:38603 old people :10736
## Hatchback :18915 Male :29253 older work. people:16189
## Station wagon:16261 oldest people : 6547
## Utility : 4586 working people :15767
## Truck : 1750 young people :12875
## Hardtop : 1579 youngest people : 5742
## (Other) : 2532
## ClaimOcc ClaimNb ClaimAmount
## Min. :0.00000 Min. :0.00000 Min. : 0.0
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.: 0.0
## Median :0.00000 Median :0.00000 Median : 0.0
## Mean :0.06814 Mean :0.07276 Mean : 137.3
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.: 0.0
## Max. :1.00000 Max. :4.00000 Max. :55922.1
##
As a benchmark, we will first fit a Poisson GLM (with log
link) to claim frequency. Note that Exposure
represents how long the policy has been in force / observed, which is used as an offset
term in model fitting. The usual model summary is also given below.
glm_fit = glm(ClaimNb ~ Gender + DrivAge +
VehBody + VehAge + VehValue,
offset = Exposure,
family = poisson(link = "log"),
data = ausprivauto0405)
summary(glm_fit)
##
## Call:
## glm(formula = ClaimNb ~ Gender + DrivAge + VehBody + VehAge +
## VehValue, family = poisson(link = "log"), data = ausprivauto0405,
## offset = Exposure)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8211 -0.4128 -0.3587 -0.3153 4.7158
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.413863 0.320949 -7.521 5.44e-14 ***
## GenderMale -0.016251 0.030083 -0.540 0.589057
## DrivAgeolder work. people 0.208764 0.048913 4.268 1.97e-05 ***
## DrivAgeoldest people 0.005946 0.064328 0.092 0.926357
## DrivAgeworking people 0.235899 0.049037 4.811 1.50e-06 ***
## DrivAgeyoung people 0.281227 0.050684 5.549 2.88e-08 ***
## DrivAgeyoungest people 0.450164 0.059049 7.624 2.47e-14 ***
## VehBodyConvertible -1.864772 0.668202 -2.791 0.005259 **
## VehBodyCoupe -0.647945 0.336949 -1.923 0.054483 .
## VehBodyHardtop -0.864757 0.327828 -2.638 0.008344 **
## VehBodyHatchback -1.018979 0.318260 -3.202 0.001366 **
## VehBodyMinibus -1.083799 0.350034 -3.096 0.001960 **
## VehBodyMotorized caravan -0.461968 0.409251 -1.129 0.258976
## VehBodyPanel van -0.850211 0.338824 -2.509 0.012097 *
## VehBodyRoadster -0.674147 0.659735 -1.022 0.306854
## VehBodySedan -0.974533 0.317675 -3.068 0.002157 **
## VehBodyStation wagon -0.973794 0.317997 -3.062 0.002197 **
## VehBodyTruck -1.015281 0.328322 -3.092 0.001986 **
## VehBodyUtility -1.189603 0.322057 -3.694 0.000221 ***
## VehAgeoldest cars -0.051289 0.040946 -1.253 0.210354
## VehAgeyoung cars 0.105528 0.039636 2.662 0.007757 **
## VehAgeyoungest cars -0.004183 0.048134 -0.087 0.930754
## VehValue 0.034926 0.016855 2.072 0.038248 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 25716 on 67855 degrees of freedom
## Residual deviance: 25561 on 67833 degrees of freedom
## AIC: 35042
##
## Number of Fisher Scoring iterations: 6
In order to fit an LRMoE, we first need to transform the dataset. Since DriverAge
, VehBody
and VehAge
are factors, we will use model.matrix
to augment a dummy matrix of zeros and ones to represent them.
# 1: Intercept
df = data.frame(Intercept = rep(1, nrow(ausprivauto0405)))
# 2: Gender
df = cbind(df, GenderMale = model.matrix(~Gender, data = ausprivauto0405)[,2])
# 3-7: DrivAge
df = cbind(df, model.matrix(~DrivAge, data = ausprivauto0405)[,2:6])
# 8-19: VehBody
df = cbind(df, model.matrix(~VehBody, data = ausprivauto0405)[,2:13])
# 20-22: VehAge
df = cbind(df, VehAge = model.matrix(~VehAge, data = ausprivauto0405)[,2:4])
# 23: VehValue
df = cbind(df, VehValue = ausprivauto0405$VehValue)
# 24: Exposure
df = cbind(df, Exposure = ausprivauto0405$Exposure)
# 25: ClaimNb
df = cbind(df, ClaimNb = ausprivauto0405$ClaimNb)
df = as.matrix(df)
When fitting an LRMoE model, a good starting point usually reduces run time and can potentially land on a better fit. In the LRMoE package, we provide a function for parameter initialization based on the Clustered Method-of-Moments (CMM).
The following demonstrates how to initialize a 3-component LRMoE model.
The init
object contains a list of summary statistics of Y
, as well as a set of parameter initialization. Let us consider a 3-component Poisson mixture.
# Initialization of alpha
alpha_init = init$alpha_init
alpha_init
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 1.9156411 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0.9463612 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 0.0000000 0 0 0 0 0 0 0 0 0 0 0 0
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0 0 0
# Initialization of component distributions
# The list indices are specified as:
# [[1]]: the 1st dimension of the response of Y
# [[1]]/[[2]]/[[3]]: the 1st, 2nd and 3rd latent component/group
# [[5]]: the 5th ponential choice of expert is zipoisson
comp_dist = matrix(list(init$params_init[[1]][[1]][[1]]$distribution,
init$params_init[[1]][[2]][[1]]$distribution,
init$params_init[[1]][[3]][[1]]$distribution),
nrow = 1, byrow = TRUE)
params_init = matrix(list(init$params_init[[1]][[1]][[1]]$params,
init$params_init[[1]][[2]][[1]]$params,
init$params_init[[1]][[3]][[1]]$params),
nrow = 1, byrow = TRUE)
If interested, we can also check some summary statistics of Y
based on the initialization. For example, the proportion of zeros and the mean of positive Y
’s are given below. These summary statistics may help with selecting the appropriate expert functions for each latent component/group.
init$zero_y
## [,1] [,2] [,3]
## [1,] 0.9303021 0.931147 0.9442322
init$mean_y_pos
## [,1] [,2] [,3]
## [1,] 1.073273 1.052541 1.068493
With an initialization of model, we can now fit the LRMoE with zero-inflated Poisson experts. Note that exposure
is also taken into account. In the case of Poisson distribution with parameter \(\lambda\), a policy with exposure \(E_i\) would have claim frequency distribution \(Poisson(\lambda E_i)\). The same rate parameter \(\lambda\) is shared by all policyholders who have potentially different policy exposures.
(Note: The fitting function takes more than 30 minutes to run. Adjusting parameters such as a larger eps
or smaller ecm_iter_max
will provide a quicker, but potentially worse, model fit.)
LRMoE_fit = LRMoE::FitLRMoE(Y = Y, X = X,
alpha_init = alpha_init,
comp_dist = comp_dist, params_list = params_init,
exposure = exposure,
exact_Y = TRUE,
eps = 0.05, ecm_iter_max = 25)
We can examine the fitted model as follows, as well as its loglikelihood, AIC and BIC. Note that, although we are fitting the same dataset, the result above is different from that presented in the Actuary Magazine. This is most likely due to a different initialization as well as stopping conditions. In practice, such differences are unlikely to have a material impact, as long as the overall goodness-of-fit is similar for these different models.
LRMoE_fit$alpha_fit
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 2.0585699 -0.005688619 0.05877480 0.005617090 0.06715980 0.08454027
## [2,] 0.8942683 -0.003519320 0.04131673 0.004095807 0.04718328 0.05913152
## [3,] 0.0000000 0.000000000 0.00000000 0.000000000 0.00000000 0.00000000
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 0.14109229 -0.3651218 -0.09086990 -0.18098895 -0.22567502 -0.227055362
## [2,] 0.09868725 -0.1000964 0.08362063 0.02455771 -0.00649267 -0.007299331
## [3,] 0.00000000 0.0000000 0.00000000 0.00000000 0.00000000 0.000000000
## [,13] [,14] [,15] [,16] [,17] [,18]
## [1,] -0.02089958 -0.19311414 -0.1066918 -0.2138805707 -0.210504745 -0.227362371
## [2,] 0.12861213 0.01369662 0.0543942 0.0006437923 0.003129555 -0.009435871
## [3,] 0.00000000 0.00000000 0.0000000 0.0000000000 0.000000000 0.000000000
## [,19] [,20] [,21] [,22] [,23]
## [1,] -0.26573563 -0.01533255 0.03216934 0.01467385 0.008048246
## [2,] -0.03527813 -0.01064473 0.02187543 0.01022420 0.005992822
## [3,] 0.00000000 0.00000000 0.00000000 0.00000000 0.000000000
LRMoE::print_expert_matrix(LRMoE_fit$model_fit)
## [1] "Dimension 1 Component 1 :"
## [1] "poisson : lambda=0.170144405986577"
## [1] "Dimension 1 Component 2 :"
## [1] "poisson : lambda=0.145955830488811"
## [1] "Dimension 1 Component 3 :"
## [1] "poisson : lambda=0.0771593490653163"
LRMoE_fit$ll
## [1] -17473.37
LRMoE_fit$AIC
## [1] 35028.25
LRMoE_fit$BIC
## [1] 35475.38