What you are doing is introducing multiple comparisons. For a confirmatory analysis, we usually specify our primary analysis and we may fit some post-hoc or secondary analyses not to confirm the prior findings, but to understand limitations in the data. Without any description of the various analyses you have conducted, I can't make any clear recommendations, but I suspect you are applying incorrect methods in several ways.
The intraclass correlation coefficient (ICC) is a measure of proportion of variance within a cluster, and can be used to motivate a mixed modeling approach for analysis of longitudinal or panel data. You seem to describe applying the ICC to individual analyses (such as regression or classification models) which doesn't make sense, and is not in-line with the intended purposes of the model. Concordance correlation coefficient (CCC) is a measure of calibration of statistical risk prediction models which, to be clear, involves a single risk prediction per participant and requires separate test/training datasets. CCC can compare several risk models, but I emphasize: risk models in panel data is very nuanced, and I don't get a sense that that's what you're doing here.
"Agreement" or interater agreement is yet another type of finding which has to do with evaluating several replications of a test applied in a large population. While statistical testing does have some relationship with classification, it is not correct to apply measures of "agreement" in this setting because statistical tests have no source of variability outside of the data themselves. Examples of settings in which agreement would be applied would be in settings where multiple radiologists are classifying different screens as benign versus possible cancer.
So I can't really find a place to begin with your problem aside from reminding you of the correct approach to statistics:
Decide (apriori) on the single analytic approach which measures an outcome of interest in a way that is understandable by the general community
Fit any subsequent models as a way of assessing sensitivity in the first model, such as loss-to-follow-up, unmeasured sources of variation, and/or autoregressive effects. Describe any possible limitations after reporting the main findings..