If your interest is in comparing the proportion of requests that attract offers for the two vehicle types, then it's a two sample proportions test.
That could indeed be done as a chi-square test.
However, it sounds like your alternative is one tailed (you're interested in telling if there's a bias in a particular direction). If that direction of anticipated bias was not based on the data you use in the test and you don't wish to pick up any bias in the opposite direction, you might do a one-tailed test instead.
Otherwise the chi-square test does the same job and you could use it since you have some familiarity with it.
What is a typical number of requests for each type and what fraction of them overall result in offers?
The two sample proportions test statistic can be found on this page (though it calls it "Two-proportion z-test, pooled for $H_0\colon p_1=p_2$").
For that, your requests are the $n$'s and the offers are the $x$'s.
The corresponding chi-square test is discussed here.
There, the O's are the offer counts and the counts of requests that got no offers. The E's are defined from the O's using the formula at the link.
An explanation of p-values is here; the first sentence defines them.
If you need additional explanation please clarify what you need.
Major edit in response to questions in comments:
but it's good to know that that is the proper way to do it.
Whoah. I didn't say that.
And your calculations up there don't include a continuity correction.
With CC:
> chisq.test(matrix(c(54,22,328,230),nr=2))
Pearson's Chi-squared test with Yates' continuity correction
data: matrix(c(54, 22, 328, 230), nr = 2)
X-squared = 3.709, df = 1, p-value = 0.05412
Without CC:
> chisq.test(matrix(c(54,22,328,230),nr=2),correct=FALSE)
Pearson's Chi-squared test
data: matrix(c(54, 22, 328, 230), nr = 2)
X-squared = 4.2058, df = 1, p-value = 0.04029
Could you possibly point me in the direction of a formula to calculate the p-value for 1. degrees of freedom?
A chi-square(1) is the square of a normal. You can evaluate probabilities by taking the square root and doubling the upper tail area for a normal. So if you don't have a chi-square function or table, you can use normal tables:
> 2*pnorm(sqrt(4.2058),lower.tail=FALSE)
[1] 0.04028597
and get the same result. But that only works for 1 d.f.
I haven't been able to find any myself - all the sites either have a Java application or a pre-defined table
There's a reason for this. There's no exact closed-form function. You can approximate it in various ways (e.g. by series expansions or by using ratios of polynomials or by numerical integration).
and I do not know if Excel's in-built CHITEST()-funktion applies Yates Continuity Theorem.
It doesn't do continuity correction by default. I'm not sure why you're quite so focused on it though.
About the result: You are saying that a p-value of 0.05 indicates that the situation occurs 1/20 by chance.
Not so. Please read carefully the first sentence of the p-value article I pointed to.
You need to pursue this until you comprehend why the answer to your next question:
Does that mean 19/20 are biased?
.... is 'obviously not'.
if I want a precise and true p-value I need to use a formula
Nothing about the continuity correction makes it either 'precise' or 'true' and in any case you don't need to use 'a formula' to calculate the p-value after using the continuity correction.
I've tried plotting the p-value against X^2 of 1 df and creating an exponential trendline. The fit has an R^2-value of 0.993 - do you think I can use the function of the trendline as my formula for p-value?
Not in general, no. Not even if you didn't have it out by what looks like a factor of 10.
Major edit 2, in response to further comments:
The p-value was never going to be the essential factor in the report -
This sentence makes me happy. Significance tests are useful in the right context, but for some reason they get used much more often than I'd ever think is reasonable.
its purpose was to give the user a quick idea of the nature of the transactions that have taken place with the specific supplier.
That might perhaps be better served by a measure of the effect size, (such as a difference in proportion, with an accompanying confidence interval to give some sense of whether the difference is explainable by chance).
I need to consider the "balance between the costs of the two types of error, and between the probabilities of the two types of error at your sample size." Could you explain what this means?
This was in relation to choosing a significance level; if p-values aren't particularly important to you it may not matter.
These are the two error types I was referring to:
http://en.wikipedia.org/wiki/Type_I_and_type_II_errors
In your case, the probability of the second type of error is a function of the difference in proportion; often what is done is a particular effect size is chosen and the desired power at that effect size is used to determine the power curve, usually by choosing the sample size for a study, but sometimes the Type I error rate (significance level) is moved instead.
In your case, if you're trying to choose a type I error rate and have a fixed sample size, you have the power at some given effect size to trade off against.
I lack an intuitive understanding of how the probabilities relate to the real world.
There is approximately a 1/25 chance (no Yates') of getting the REQUESTS/OFFERS combination in this case
Only if the null hypothesis is true. (Which it won't be.)
- why should this be a weak indication of bias? Why not 1/100?
I wish more people would ask such a question.
What is 'weak' or 'strong' depends on the context; the tradeoff I mentioned is part of that calculation. Let's say that it's what is more typically regarded as weak evidence -- since a 1/20 event is hardly astonishing. Your own particular needs and circumstances should always trump convential senses of what's weak or strong.
Incidentally, I think you should feel no embarrassment over not getting the notion of a p-value first go. It is a somewhat subtle, even counterintuitive idea and is one of the most misunderstood concepts in the whole of statistics. Indeed, I am often asked by people "Is this text any good?" - often one I've never seen.
One of the first things I do in evaluating a text is to check whether it screws up on explaining what a p-value is. That easily eliminates a quarter of textbooks (often with titles like "Introductory statistics for ________") on the spot. If a large fraction of the people teaching in some department or other that has an intro stats class still get it wrong after sometimes decades at it, you shouldn't feel too bad. In fact I was surprised when I first read the first sentence in wikipedia's p-value article to find it got it right. I expected to have to fix it.
I've also seen it wrong in videos for online courses. I've even seen it wrong in academic papers once or twice (fortunately actual statisticians don't get it wrong very often - it's usually someone in some other area doing it as a sideline).
Is it something that should be settled experimentally by analysing several cases we know are biased, and then using the average p-value of those as the significance level?
Not really - the typical size of bias in cases that are biased don't tell you that a bias one tenth as large isn't important. How much bias (as measured by difference in probability or odds-ratio or whatever) would matter? That's the sort of effect size you should focus on.
Could one say that this new significance level had been adjusted to the specific need of the analysis: To pinpoint bias?
I don't know that I properly follow here, but if you decide on your type I and type II error rates together by comparing the relative loss of making the two types of error at no difference (type I) and at the minimum relevant effect size (type II), you can say it was chosen with regard to detecting bias of a size that's of practical importance.
Major edit 3, in response to additional questions
Your suggestion of measuring the effect size sounds interesting. This is a screenshot of the report I am working on. There is a difference of 42.4% between OWN and T/C, with regards to the percentage of offers received relative to vehicle type. If I've understood you correctly, this is the effect size. This was actually what I was considering doing prior to the chi-squared test, but I wasn't sure how to deal with the fact that I must define an arbitrary threshold ("If the effect size is bigger than 20%, then the supplier is likely biased") -- as well as how to include sample size in my considerations.
You can perhaps get the best of both worlds if you compute a confidence interval for that difference, as I already mentioned.
If the CI includes zero, it's equivalent to saying that 'it could be explained by random variation' and thus gives teh same kind of information as a hypothesis test. If you use the right choice of interval, the interval will even exactly correspond to a chi-square, like so --
Proportions interval and test:
> prop.test(x=c(54,328),n=c(76,558),alt="two.sided",correct=FALSE)
2-sample test for equality of proportions without continuity
correction
data: c(54, 328) out of c(76, 558)
X-squared = 4.2058, df = 1, p-value = 0.04029
alternative hypothesis: two.sided
95 percent confidence interval:
0.01287586 0.23254953
sample estimates:
prop 1 prop 2
0.7105263 0.5878136
Corresponding chi-square test:
> chisq.test(matrix(c(54,22,328,230),nr=2),correct=FALSE)
Pearson's Chi-squared test
data: matrix(c(54, 22, 328, 230), nr = 2)
X-squared = 4.2058, df = 1, p-value = 0.04029
(See how the p-value is the same? They is the same test; the 100(1-\alpha)% confidence interval for the difference doesn't include zero whenever the level-$\alpha$ test is significant)
In any case the other common two-sample proportions intervals will come close to the same as the chi-square anyway (in the sense that the effective p-values will generally be pretty close).
The chi-squared test seems to package all of this neatly into a single quantity, so I would prefer to make the significance level useful for my purposes while also including the effect size in the report.
This is fine - but you can also do similar "neat packages" in other ways.
If I know the effect size I am testing for, using power analysis I can calculate the minimum sample size required for a good chi-square test. If my sample size exceeds this, I can then move on to finding significance level. Is this correct?
Forget the sample size calculation if you already have a sample whose size you can't change; what you do is work out the power associated with the sort of effect size you'd regard as important to pick up; it tells you not whether you can do the test (you can!) but how good the test will be at picking up interesting deviations.