Wouldn't it be more efficient if we construct an MDP from experience by computing the state transition probabilities and reward distribution from experience and solve it by dynamic programming?
Depends what you mean by "efficient". The Dyna-Q code given is already strongly related to Value Iteration in any case. If in section (f) you replaced "random previously" steps with repeated full sweeps of all states and actions then you could also set $\alpha = 1$, check for maximum value change below some threshold and the algorithm would be Value Iteration exactly (for a deterministic environment as explained in the pseudocode).
However, a basic dynamic programming approach requires you to collect all state transition rules accurately. If you have a non-deterministic environment* and/or a large state space, this may not be feasible in practice. The algorithm would not start learning until after you collected data, and you have no guidance available for how to efficiently explore the state and action space (because your learning algorithm has nothing to base a policy on).
Dyna-Q allows the agent to start learning and improving incrementally much sooner. It does so at the expense of needing to work with rougher sample estimates of transition probabilities, and therefore needs a recency-weighted value update. In practice the early rough convergence of Q values helps focus exploration and learning effort, and the algorithm will be more efficient both in terms of number of samples required and amount of computation than basic Dynamic Programming across a wide range of problems.
There will be environments where DP methods will be better for some reason though. If you are concerned about it, you need to run some experiments that will help you determine what works for your examples.
The weakness of DP methods may also be an advantage - if you are required to learn an optimal policy for all states of an environment, then DP could well be more efficient way to do this. Dyna-Q will tend to focus on generating accurate values and an optimal policy only on reachable states from start state whilst following a near optimal policy, it may never learn about edge cases. However, this may be a necessary compromise for large state spaces, and if you control the start states seen by agents when they need to act optimally, then there will be no need to learn policies that deal with unreachable states.
I suggest compare the two approaches on a simple problem, like a gridworld, to get a feel for the differences.
* If you have a non-deterministic environment then an accurate transition model needs good estimates of state transition probabilities. This requires that you sample transitions many times, so can multiply the amount of experience you require before the model is reliable enough for DP, compared to deterministic environments. If you have a deterministic environment with 1000 states and 10 actions in each state, then in theory 10,000 carefully crafted experiences will be enough to provide an accuare model for DP. However, if state transitions are stochastic, then you may need 1,000 times that or even more experience before your model of the probabilities is accurate enough to use with DP.
All model-based approaches are sensitive to accuracy in the model, and the more emphasis you put on forward planning based on the model, the more sensitive they are to any errors. You can view Dynamic Programming approaches as being highly sensitive to such errors as they are designed to iterate until convergence, so effects of errors can be magnified.