You can have an on-policy RL algorithm that is value-based. An example of such algorithm is SARSA, so not all value-based algorithms are off-policy. A value-based algorithm is just an algorithm that estimates the policy by first estimating the associated value function.
To understand the difference between on-policy and off-policy, you need to understand that there are two phases of an RL algorithm: the learning (or training) phase and the inference (or behaviour) phase (after the training phase). The distinction between on-policy and off-policy algorithms only concerns the training phase.
During the learning phase, the RL agent needs to learn an estimate of the optimal value (or policy) function. Given that the agent still does not know the optimal policy, it often behaves sub-optimally. During training, the agent faces a dilemma: the exploration or exploitation dilemma. In the context of RL, exploration and exploitation are different concepts: exploration is the selection and execution (in the environment) of an action that is likely not optimal (according to the knowledge of the agent) and exploitation is the selection and execution of an action that is optimal according to the agent's knowledge (that is, according to the agent's current best estimate of the optimal policy). During the training phase, the agent needs to explore and exploit: the exploration is required to discover more about the optimal strategy, but the exploitation is also required to know even more about the already visited and partially known states of the environment. During the learning phase, the agent thus can't just exploit the already visited states, but it also needs to explore possibly unvisited states. To explore possibly unvisited states, the agent often needs to perform a sub-optimal action.
An off-policy algorithm is an algorithm that, during training, uses a behaviour policy (that is, the policy it uses to select actions) that is different than the optimal policy it tries to estimate (the optimal policy). For example, $Q$-learning often uses an $\epsilon$-greedy policy ($\epsilon$ percentage of the time it chooses a random or explorative action and $1-\epsilon$ percentage of the time it chooses the action that is optimal, according to its current best estimate of the optimal policy) to behave (that is, to exploit and explore the environment), while, in its update rule, because of the $\max$ operator, it assumes that the greedy action (that is, the current optimal action in a given state) is chosen.
An on-policy algorithm is an algorithm that, during training, chooses actions using a policy that is derived from the current estimate of the optimal policy, while the updates are also based on the current estimate of the optimal policy. For example, SARSA is an on-policy algorithm because it doesn't use the $\max$ operator in its update rule.
The difference between $Q$-learning (off-policy) and SARSA (on-policy) is respectively the use or not of the $\max$ operator in their update rule.
In the case of policy-based or policy search algorithm (e.g. REINFORCE), the distinction between on-policy and off-policy is often not made because, in this context, there isn't usually a clear separation between a behaviour policy (the policy to behave during training) and a target policy (the policy to be estimated).
You can think of actor-critic algorithms as value and policy-based because they use both a value and policy functions.
The usual examples of model-based algorithms are value and policy iterations, which are algorithms that use the transition and reward functions (of the given Markov decision process) to estimate the value function. However, it might be the case that you also have on-policy, off-policy, value-based or policy-based algorithms that are model-based, in some way, that is, they might use a model of the environment in some way.