Data Formatting
There are a few data formatting issues to note when using the LRMoE package, which are summarized below. Throughout this page, we assume there are N = 100 insurance policyholders, who have two coverages y_1 and y_2 (dimension of response D = 2). Each policyholder has 5 covariates (number of covariates P = 5).
Covariate
X: The dimension ofXshould be 100 by 5 (= N * P), which is relatively straightforward. If there are factor covariates (e.g. indicators for urban and rural region), the intercept term should be manually added, and the user needs to manually augment a column of zero-one indicator.Response
Y(Exact Case): If bothy_1andy_2are observed exactly (i.e. the incurred losses are not censored or truncated), then we can setexact_Y = truein the initialization and fitting function. In this case, the dimension ofYshould be 100 by 2 (= N * D): the first column is fory_1and the second fory_2.Response
Y(Not Exact Case): If eithery_1ory_2is not observed exactly, thenexact_Y = falseand the dimension ofYshould be 100 by 8 (= N * 4D): the first four columns are fory_1and the remaining fory_2. For each block of 4 columns, they should be structured as (tl,yl,yu,tu), corresponding to the lower bound of truncation, lower bound of censoring, upper bound of censoring and upper bound of truncation, respectively. Some typical cases are listed below:- Both
y_1andy_2are observed exactly: Assumey_1 = 2.0andy_2 = 3.0, then the first row ofYshould be[0.0 2.0 2.0 Inf 0.0 3.0 3.0 Inf]. Alternatively, we can setexact_Y = true(see the previous case), and also set the first row ofYas[2.0 3.0]. y_1is observed exactly,y_2is left-truncated at1.0but observed exactly (e.g. an insurance deductible): Assumey_1 = 2.0andy_2 = 3.0, then the first row ofYshould be[0.0 2.0 2.0 Inf 1.0 3.0 3.0 Inf].y_1is observed exactly,y_2is right-censored at2.0(e.g. a payment limit): Assumey_1 = 2.0andy_2 = 3.0, then the first row ofYshould be[0.0 2.0 2.0 Inf 0.0 2.0 Inf Inf].y_1is observed exactly,y_2is only observed within a range: Assumey_1 = 2.0andy_2 = 3.0, but we only observe2.5 < y_2 < 3.5, then the first row ofYshould be[0.0 2.0 2.0 Inf 0.0 2.5 3.5 Inf].
- Both
Logit Regression Coefficients
α: Assume we would like to fit an LRMoE with three latent classes (g = 3), then the dimension ofαshould be 5 by 3 (= N * g). For example, a noninformative guess can be initialized asα = fill(0.0, 5, 3).Component Distribution
comp_dist: Assume we would like to fit an LRMoE with three lateht classes (g = 3), then the dimension ofcomp_distshould be 2 by 3 (= D * g). For example, if we assumey_1is a mixture of lognormals andy_2is a mixture of gammas, thencomp_dist = [LogNormalExpert(1.0, 2.0) LogNormalExpert(1.5, 2.5) LogNormalExpert(2.0, 3.0); GammaExpert(1.0, 2.0) GammaExpert(1.5, 2.5) GammaExpert(2.0, 3.0)]. Note that the columns ofcomp_distshould be distinct, otherwise the model is not identifiable.Penalty on Logit Regression Coefficients
pen_α: It should be a single number (i.e. a uniform penalty imposed on all coefficients inα). For example, whenpen_α = 2.0, then the penaltysum( (α ./ 2.0).^2 )is subtracted from the loglikelihood as a penalty. In other words, we would not like the magnitude ofαto be too large.Penalty on Parameters of Expert Functions
pen_params: Using thecomp_distmentioned above,pen_paramsshould be a matrix of size 2 by 3 (= D * g), where each entry is a vector of real numbers penalizing the parameters of expert functions. Usually, the user can leave this argument as default. For more details, the user is referred to the package source code.