I know the detail of AlphaZero. And in detail, I know it is improving by "policy iteration" mechanism. I found an answer that prove it can finally converge to optimal. But... Is it still correct when using policy iteration with neural network like in AlphaZero?
Besides, we know AlphaZero is learning from very basic as an idiot, and it is improving by data generated by itself. But what if we let AlphaZero fight against chessMaster first? Woud this setting make it improve faster?