Since these are binomial data, I would analyze them as such, using appropriate (mixed-effects) logistic regression models. Your dataset will have a block id variable, a condition dummy variable (0 for condition 1, 1 for condition 2), and the number of successes and failures within each condition within each block. For example, for the data you posted (using R):
dat <- data.frame(block = c(1,1,2,2,3,3,40,40),
condition = c(0,1,0,1,0,1,0,1),
successes = c(3,4,0,3,9,2,1,6),
failures = c(8,2,12,2,0,5,3,0))
dat
This looks like this:
block condition successes failures
1 1 0 3 8
2 1 1 4 2
3 2 0 0 12
4 2 1 3 2
5 3 0 9 0
6 3 1 2 5
7 40 0 1 3
8 40 1 6 0
Then we can fit a logistic regression model with a block factor and the condition dummy:
res1 <- glm(cbind(successes, failures) ~ factor(block) + condition, data=dat, family=binomial)
summary(res1)
This yields (cutting out some stuff here):
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.7006 0.5542 -1.264 0.206
factor(block)2 -1.1749 0.8209 -1.431 0.152
factor(block)3 1.1215 0.7479 1.500 0.134
factor(block)40 1.0269 0.8722 1.177 0.239
condition 0.9361 0.6029 1.552 0.121
The idea here is that the overall success rate may differ across blocks, so we allow for such differences by including the block factor. The coefficient for the condition dummy is the estimated log odds ratio for condition 2 versus 1 within blocks. If the success rate is unaffected by condition, then this implies a log odds ratio of 0, so the p-value given here is what you are looking for (i.e., it tests the null hypothesis that the true log odds ratio is 0).
The model above assumes that the true log odds ratio is constant across blocks. That may not be the case. We can model such differences by adding a random effect for the condition dummy:
library(lme4)
res2 <- glmer(cbind(successes, failures) ~ factor(block) + condition + (condition - 1 | block), data=dat, family=binomial)
summary(res2)
This yields (again cutting out some things):
Random effects:
Groups Name Variance Std.Dev.
block condition 15.38 3.921
Number of obs: 8, groups: block, 4
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9907 0.6722 -1.474 0.1405
factor(block)2 -3.3084 2.4908 -1.328 0.1841
factor(block)3 4.0845 1.8769 2.176 0.0295 *
factor(block)40 0.2032 1.2622 0.161 0.8721
condition 2.1526 2.4080 0.894 0.3713
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now we test if the average log odds ratio is significantly different from 0 (note that with only 4 blocks of data, this is stretching things a bit, but with 40 blocks, you should be fine).
You can test if the second model is actually a significant improvement over the first with:
anova(res2, res1)
This yields:
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
res1 5 49.418 49.815 -19.709 39.418
res2 6 38.118 38.595 -13.059 26.118 13.3 1 0.0002655 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In essence, this is testing the null hypothesis that the variance of the random condition effect is 0. Again, with so little data, this is just an illustration, but the results above suggest that there is significant variability in the condition effect. Assuming that you find something similar in the full dataset, you may need to qualify your conclusions a bit more (i.e., you can say something about whether condition affects the outcome on average).