Conducting a $\chi^2$ test is totally appropriate. Your last sentence:
My initial thought was to do a chi squared test of homogeneity, but such a test would be punished as the number of images increases, whereas it seems intuitively that my chosen test should become more powerful the more images I use.
can be interpreted a few different ways. One way to do the test would be to have, say 30 subjects conduct your experiment. Then simply bin the observations into either A or B. Which would result in the table.
A B
-------
100 200
The expected number for each bin would be $n \cdot 0.5 = 300 \cdot 0.5 = 150$, and so we see that (in R code), this example one would reject the null where each bin has equal probability.
dat <- c(100,200)
chisq.test(dat)
Conducting this test would be more powerful if you gave the same number of subjects more images. Another way to conduct the test though would be to create a 10 by 2 table, where each row is for a different image. e.g.:
Image A B
--------------
1 6 4
2 etc...
3
4
5
6
7
8
9
10
This approach has the advantage that you can examine the residuals from the table and see if any particular image is more likely to be classified into the A or B category. Since you fix the number of images shown, to correspond to the conservative rule of thumb that the expected value for any cell should be at least 5, all you need to do is to conduct your experiment on at least ten people. I'm not sure if this approach gains power to reject the null with more images, as you are adding rows to the table - it would take more investigation. (I would guess no for a very low number of people, but after say 20 people I would guess more images does increase the power.) You may also consider Fisher's exact test on such a table (although I presume the test statistic would need to be estimated via simulation).
You can do the same type of "x by 2" table for people as well, in which case each row is a person. This has the same exploratory advantage in which you can see if any persons are more likely to classify images in the A or B category. This approach will increase in power with the more images you show to persons. And finally you may consider a logistic regression model predicting the categories based on individual or image random effects. This last suggestion requires the largest sample size, but gains in power both when increasing persons and increasing images.