Cohen's Kappa was originally intended to be applied to human classifiers. You can find this described in his article, A coefficient of agreement for nominal scales.
... two psychiatrists independently making a schizophrenic-nonschizophrenic
distinction on outpatient clinic admissions might report 82
percent agreement, which sounds pretty good.
But is it? Assume for a moment that instead of carefully interviewing every admission, each psychiatrist classifies 10 percent of the admissions as schizophrenic, but does so blindly, i.e., completely at random. Then
the expectation is that they will jointly "diagnose" .10 X .10 = .01 of the sample as schizophrenic and .90 X .90 = .81 as. nonschizophrenic, a total of .82 "agreement," obviously a purely chance result. This is no more impressive than an ESP demonstration of correctly calling the results of coin tosses
blindfolded 50 percent of the time!
The example is a bit extreme, but the principle it illustrates is unexceptionable: the proportion or percent of observed agreement
between two judges assigning cases to a set of k mutually exclusive, exhaustive
categories inevitably contains an expected proportion that is entirely attributable to chance.
Thus was kappa born. It is simply the proportion of agreement corrected for chance.
That said, it has been used by others to evaluate statistical classifers. See this paper as an example: Chance-corrected Classification for Use in Discriminant Analysis: Ecological Applications and Comparison of classification accuracy using Cohen’s Weighted Kappa
Or you can follow this post as well Cohen's kappa in plain English
To your question "Are there any practical (or theoretical) concerns related to that when using the Kappa statistic in performance evaluation of a classification model?" To my knowledge, there has not been research into identifying any concerns in applying an evaluation tool intended for humans to statistical classifiers. However, you can find it is widely used, so you can feel some confidence that there is not an issue with using this tool for non-human classifiers.
If you have two classifiers, you can use the statistic to evaluate their agreement, but should not use it with the "true" values as (it is assumed) the "true" values are factual and not estimates.
Hope this helps.