I'm currently planning a study to prove my software is comparable to humans at some task. The setup is that I take an input $I$, and have both a human and a computer modify $I$ in th same way, producing $M_H$ and $M_C$. Later, I will present various experiment participants both $M_C$ and $M_H$, and ask them which is better according to some qualitative metric, or if they're the same. I then repeat this for many possible inputs.
At the end of the day, my data is: I have a set of labels representing various inputs, and, for each label, I have a set of judgments about whether a participant preferred the human or the computer output. My goal is to show that, for a random input, the probability of a random judge preferring the computer output is at least $0.5$ (or some slightly lower number).
How do I analyze this? The closest thing I found in my search was stuff on "ipsative" data, but I didn't find anything on how to analyze it, and those tests seem to all be trying to measure the human subject instead of the two choices in the question.