1) It's unclear why do we need to shake both boxes as Step 2. Based on the description, we move coins from A to B based on their probabilities of showing Heads, while we move coins from B to A based on the experimental result from Step 2. Thus, it's sufficient to shake B only. Is this correct?
2) Nevertheless, assuming I understand the setting correctly, the answer depends on the distribution of "unfairness" in your N coins. Can we assume they can be unfair to any extent, i.e. the probability for showing Heads can be any number in $(0;1)$? If yes, we could model this probability itself as a $(0;1)$-uniformly distributed variable and move forward with order statistics analysis. If no, there should be some additional condition on how unfair a coin could be, say, $0.4<=p<=0.6$ or something.
UPD
Based on your comment, the iterative process is "skewed": at each experimental iteration we move not more than M Heads coins form A to B and all Tail coins from B to A. Based on how the known probabilities $0<p_{(1)}<=p_{(2)}<=...<=p_{(N)}<1$ were initially obtained, this iterative process can either converge or not.
Here we are with a little simulation code for uniformly drawn $p_j$ (which are assumed to be known in your setting, of course):
N = 100
M = 20
p = sort(runif(N)) # uniformly distributed unfairness in coins
p.A = p[1:(N-M)] # box A contains all but M coins with largest Heads probability
p.B = p[(N-M+1):N]
BernoulliExperiment = function(prob){runif(1)<prob}
n.iter = 10000
n.A = integer(n.iter) # number of coins in box A
n.B = integer(n.iter) # number of coins in box B
for(iter in 1:n.iter){
A2B = apply(as.data.frame(p.A),1,BernoulliExperiment)
if(sum(A2B)>M)
A2B[which(A2B)[1:(sum(A2B)-M)]] = FALSE
B2A = !apply(as.data.frame(p.B),1,BernoulliExperiment)
temp = p.A[A2B]
p.A = sort(c(p.A[!A2B],p.B[B2A]))
p.B = sort(c(p.B[!B2A],temp))
rm(temp)
n.A[iter] = length(p.A)
n.B[iter] = length(p.B)
}
plot(n.A, type='l')
plot(n.B, type='l')
The process is balanced on average but no specific count is the limit for the number of coins in B. Thus, any number of repeats is insufficient for coming closer to the desired count.
On the other hand, a bimodal distribution of $p_j$ (say, all $p_j$ equal either to 0.1 or 0.9 as an extreme discrete case) could lead to a convergent process of separation into two subsets of coins. This idea comes from Order Statistics by David & Nagaraya, Chapter 8 (https://onlinelibrary.wiley.com/doi/book/10.1002/0471722162)