5

I am having trouble designing an experiment. I will give a hypothetical example that shares the main features of my actual problem.

Suppose there are:

  1. $M$ meadows, indexed by $m={1,...,M}$, where $M \approx 1000$ if that simplifies things.
  2. $F$ types of flowers, indexed by $f={1,...,F}$, where $F \approx 150$.
  3. no meadow has all the species of flowers
  4. every meadows has at least one type of flower growing on it
  5. $B_{mf}>0$ is a known and stable, but varying, value for each type of flower in a particular meadow.

I want to select $V$ varieties (where $V \in [20,30]$) and $L$ meadows such that $\sum B_{mf}$ is maximized, subject to the following constraints:

  1. all $V$ varieties must grow in every one of the $L$ selected meadows
  2. I wind up with $V\cdot L = P\approx 5000$ flower-meadow pairs to randomize for an experiment
  3. Randomization has to be at the meadow level

The $B_{mf}$ parameters are known beforehand. The treatment is bringing in bees to pollinate the flowers and $B_{mf}$s comes from a botanist who believes that particular sites and flowers would benefit a lot. The control group is denied pollination. The analysis will test the hypothesis that bees improve flower yield overall, and that this effect will be higher in meadows with higher $B_{mf}$. So it's a test of the pollination and also of the botanist. We want to bring the bees to the most promising meadows, and measure the effect there. The plan it to test less promising meadows later.

The $L$ meadows will be split into a treatment and control group and compared in terms of some outcome $Y$. In addition to the sampling problem, I am not sure if doing matched pairs makes sense here, or a more general randomized block design.

I would appreciate any advice, references, or solutions to the design and analysis. If anything is unclear, please let me know.


Addendum: I don't know if this simplifies the problem, but you can also assume that all meadows have every type of flower, but that for some meadows $B_{mf}=0$ for at least one species.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
dimitriy
  • 31,081
  • 5
  • 63
  • 138
  • I'm not sure I understand: All the $B_{mf}$ are known constants before the experiment? What will be measured in the experiment? I think your description is too brief! – kjetil b halvorsen Aug 16 '20 at 02:35
  • @kjetilbhalvorsen I added the explanation to the end of the question. Please let me know if it did not clear things up. – dimitriy Aug 16 '20 at 02:56
  • So the problem is to choose the $m, f$ indices with the $B_{mf}$ known, which looks like a pure optimization problem, and then the randomization&experimentation follows? – kjetil b halvorsen Aug 16 '20 at 03:31
  • It certainly that optimization subject to the constraints, but also what type of randomization and analysis makes sense here (matches pairs of meadows or something else). – dimitriy Aug 16 '20 at 03:37
  • Unless there are some special structure on the meadows, I cannot see how matched pairs can help? – kjetil b halvorsen Aug 16 '20 at 04:32
  • What is $B_{mf}$ again? :) – usεr11852 Aug 18 '20 at 20:42
  • Can the problem be simplified by fixing V and L? The current constrains are fuzzy to me. – Sextus Empiricus Aug 18 '20 at 21:43
  • @user11852 $B_{mf}$ is the botanist's estimate of the effect on flower type $f$ in meadow $m$ from adding a hive of bees. The assumption is that bees help flowers grown. It is non-stochastic and taken as given. – dimitriy Aug 18 '20 at 23:18
  • @SextusEmpiricus Fixing $V$ and $L$ is specified in the problem statement. $V$ is fixed between 20-30 and $L$ is pinned down by $\frac{5000}{V}$. If that is not what you meant by "fixed", please elaborate on what your definition is and what "fuzzy" means. – dimitriy Aug 18 '20 at 23:19
  • @Dimitriy you wrote in your question $LV \approx 5000$ and not $LV = 5000$. How far from 5000 can we be? – Sextus Empiricus Aug 19 '20 at 06:30
  • Under is OK. Over is not. Let's say within 150 – dimitriy Aug 19 '20 at 06:33

1 Answers1

3

First, the problem of maximizing $\sum B_{mf}$ is a knapsack problem, there does exist at least one R package for such problems, adagio. If $B_{mf}$ is undefined for some combinations, just give it some negative value!

Then, assuming some pairs are chosen that way, the problem is the design. If the treatment effect really depends on $B_{½mf}$ (and to test that), it might be an advantage to restrict the randomization to some balancing of the $B_{mf}$. You asks about a paired design, which would entail selecting paits of meadows with close $B$'s. Maybe that will loose too many df's, so just define blocks with little variation in the $B$'s, and randomize within the blocks. Hope I have understood the problem correctly ...

More about the knapsack problem Introduce binary indicator variables $I_m$ for the chosen meadows and $J_f$ for the chosen flowers, then you want to maximize $\sum_m\sum_f I_m B_{mf} J_f$. Observe that this give a solution that satisfy your restriction

all V varieties must grow in every one of the L selected meadows

Which asks for selecting a (rectangular ) submatrix of $B$ that maximizes the sum of its elements. In matrix terms, this means maximizing the quadratic expression $$ I^T B J $$ where $I, J$ are the vectors with elements $I_m, J_f$, under the restriction $\sum I_m \leq L, \sum J_f \leq V$. When some $B_{mf}$ is undefined, assign those a sufficiently large negative value might be sufficient, but what to do in this case might depend on the pattern (and quantity) of undefinedness.

This problem is similar to the Quadratic knapsack problem but extends that too, and seems to be known as the maximum sum rectangular submatrix problem. But some formulations of that problem seems to seek for a contiguous submatrix, which is not what we want! I don't have time to study this more now, but have asked this question: https://or.stackexchange.com/questions/4730/a-variant-of-maximum-sum-subarray-problem so hopefully some of the people there with experience in combinatorial optimization can help.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • [A list of Qs with answers about knapsack problems](https://stats.stackexchange.com/search?q=knapsack+answers%3A1) – kjetil b halvorsen Aug 16 '20 at 22:29
  • 1
    This is already quite helpful. I am having some trouble mapping my meadows-flowers problem to one of the generalized KPs, or else expressing it as a set of constraint. Is there a connection that is obvious to you here? – dimitriy Aug 17 '20 at 06:38
  • 1
    Aren't the restrictions like $20 \leq \sum J_f \leq30 $. The knapsack problem with $\sum J_f \leq 30$ might give you a selection with less than 20 types of flowers but higher sum because higher $B_{mf}$ – Sextus Empiricus Aug 19 '20 at 06:35
  • @Sextus Empiricus: This is true, I will edit. Also I guess the pattern of undefinedness may play games ... – kjetil b halvorsen Aug 19 '20 at 23:58