I am testing the accuracy of a machine learning approach that counts cars in images. I have both a predicted dataset and a "real" dataset that was generated by a human. For example, this is what my data looks like:
image real_count predicted_count
A 6 6
B 5 6
C 0 1
D 7 6
E 6 6
F 9 11
G 1 1
I am trying to assess how well the predicted data holds up against the real data. Is it appropriate to use a confusion matrix and the associated measures of agreement such as kappa to assess the accuracy of the predicted data? Is there a more suitable measure of accuracy for this type of frequency data?