0

I know that this question has already been formulated in other posts where there are quite comprehensive answers, I refer especially to these two explanations (thread1, thread2). However I continue to have doubts because in the literature related to my analysis different approaches have been adopted and also because I have little experience with models for count data. I am conducting an analysis using a dataset on a company's auto insurance policies.The goal is to predict, through cross-validation, the number of claims, weighted by exposure (the offset), from which to derive the claim frequency that will later be used to calculate fair premiums (claim frequency multiplied by average damage cost per claim).

The premiums, which will also be predicted and evaluated with those observed, however, should not refer to individual policies, but to tariff classes chosen to group the insured according to their degree of risk.

For this reason, the independent variables used for the number of claims are all categorical. Suppose I have 5 independent variables structured as follows:

x1: 5 levels; x2: 6 levels; x3: 6 levels; x4: 10 levels; x5: 7 levels.

I would have therefore in theory 12600 classes.

Now my doubt is whether to estimate the models for the number of claims using the individual policies as observations or with the data grouped by class (so my variable y would be the sum of the claims of the policies belonging to each class).

In the first case (individual policies) I would have more than 400000 observations, the variable y has a range of 0-4, with 96% of zeros, a very close mean and variance (0.039 and 0.042 respectively), for which I would choose a zero-inflated poisson.

In the second case (aggregated by classes) I would have about 6700 (classes observed in the sample, not all possible 12600 combinations) observations, the variable y has a range of 0-240, with 54% of zeros, a very different mean and variance (2.65 and 90.50 respectively), for which I would choose between a negative binomial or quasipoisson, because the excess of zeros is not an issue. However, I fear that with this sample I would have problems of loss of information, ecological fallacy, and strong variability between classes (some classes collect more than 6000 policies, while others only 1). But as I said, in some studies I have seen aggregated data used to estimate and predict claim frequency for premium calculation.

What data do you think is more correct to use? I could also use the predicted values by individual policies and then aggregate them to calculate the frequencies (and premiums) of the classes. Would that be correct?

In case the aggregated data is suitable for modeling, when should I split the data set for cross-validation: before or after aggregation?

UzbeKistaN
  • 51
  • 4

0 Answers0