Data Formatting
There are a few data formatting issues to note when using the LRMoE package, which are summarized below. Throughout this page, we assume there are N = 100
insurance policyholders, who have two coverages y_1
and y_2
(dimension of response D = 2
). Each policyholder has 5 covariates (number of covariates P = 5
).
Covariate
X
: The dimension ofX
should be 100 by 5 (= N * P
), which is relatively straightforward. If there are factor covariates (e.g. indicators for urban and rural region), the intercept term should be manually added, and the user needs to manually augment a column of zero-one indicator.Response
Y
(Exact Case): If bothy_1
andy_2
are observed exactly (i.e. the incurred losses are not censored or truncated), then we can setexact_Y = true
in the initialization and fitting function. In this case, the dimension ofY
should be 100 by 2 (= N * D
): the first column is fory_1
and the second fory_2
.Response
Y
(Not Exact Case): If eithery_1
ory_2
is not observed exactly, thenexact_Y = false
and the dimension ofY
should be 100 by 8 (= N * 4D
): the first four columns are fory_1
and the remaining fory_2
. For each block of 4 columns, they should be structured as (tl
,yl
,yu
,tu
), corresponding to the lower bound of truncation, lower bound of censoring, upper bound of censoring and upper bound of truncation, respectively. Some typical cases are listed below:- Both
y_1
andy_2
are observed exactly: Assumey_1 = 2.0
andy_2 = 3.0
, then the first row ofY
should be[0.0 2.0 2.0 Inf 0.0 3.0 3.0 Inf]
. Alternatively, we can setexact_Y = true
(see the previous case), and also set the first row ofY
as[2.0 3.0]
. y_1
is observed exactly,y_2
is left-truncated at1.0
but observed exactly (e.g. an insurance deductible): Assumey_1 = 2.0
andy_2 = 3.0
, then the first row ofY
should be[0.0 2.0 2.0 Inf 1.0 3.0 3.0 Inf]
.y_1
is observed exactly,y_2
is right-censored at2.0
(e.g. a payment limit): Assumey_1 = 2.0
andy_2 = 3.0
, then the first row ofY
should be[0.0 2.0 2.0 Inf 0.0 2.0 Inf Inf]
.y_1
is observed exactly,y_2
is only observed within a range: Assumey_1 = 2.0
andy_2 = 3.0
, but we only observe2.5 < y_2 < 3.5
, then the first row ofY
should be[0.0 2.0 2.0 Inf 0.0 2.5 3.5 Inf]
.
- Both
Logit Regression Coefficients
α
: Assume we would like to fit an LRMoE with three latent classes (g = 3
), then the dimension ofα
should be 5 by 3 (= N * g
). For example, a noninformative guess can be initialized asα = fill(0.0, 5, 3)
.Component Distribution
comp_dist
: Assume we would like to fit an LRMoE with three lateht classes (g = 3
), then the dimension ofcomp_dist
should be 2 by 3 (= D * g
). For example, if we assumey_1
is a mixture of lognormals andy_2
is a mixture of gammas, thencomp_dist = [LogNormalExpert(1.0, 2.0) LogNormalExpert(1.5, 2.5) LogNormalExpert(2.0, 3.0); GammaExpert(1.0, 2.0) GammaExpert(1.5, 2.5) GammaExpert(2.0, 3.0)]
. Note that the columns ofcomp_dist
should be distinct, otherwise the model is not identifiable.Penalty on Logit Regression Coefficients
pen_α
: It should be a single number (i.e. a uniform penalty imposed on all coefficients inα
). For example, whenpen_α = 2.0
, then the penaltysum( (α ./ 2.0).^2 )
is subtracted from the loglikelihood as a penalty. In other words, we would not like the magnitude ofα
to be too large.Penalty on Parameters of Expert Functions
pen_params
: Using thecomp_dist
mentioned above,pen_params
should be a matrix of size 2 by 3 (= D * g
), where each entry is a vector of real numbers penalizing the parameters of expert functions. Usually, the user can leave this argument as default. For more details, the user is referred to the package source code.