I have a problem where I want to predict the outcome of a sequence given another sequence online. Let $(x_1, x_2, ... x_T)$ be denoted by $x_{1:T}$, then I am estimating: $$ p(y_T|x_{1:T}) $$ where $y_t \in Y, x_t \in X~\forall t$. That is, given a history of observations I want to predict the outcome of another variable. In practice, I predict all the $y_{1:T}$, but since I do it online I predict $y_t$ independently of $y_{t'}$ where $t' > t$.
I do that with a discriminatively trained nonlinear state space model, a recurrent neural net which is trained with gradient methods based and the negative log likelihood as a loss. (This can be thought of as a hidden Markov model with dirac distributions everywhere.)
What I am wondering is, how I can incorporate any knowledge I have about $p(y_{t+1}|y_t)$.
Example: I am modelling the position of a car $y_t$ given a history of motor commands $x_{1:t}$. I know the maximum velocity $v$ of a car, thus I could assume that $$p(y_{t+1}|y_t) \propto \begin{cases} 0~\text{if}~y_{t+1} - y_t > v \\ 1~\text{else} \end{cases} $$
My question now is: how to make use of that?
One attempt would be to penalize those hypotheses which produce output probabilities which violate $p(x_{t+1}|x_t)$. I tried this with adding a regularizer to the log likelihood during parameter estimation, but it did not work out pretty well.
Another guess is to just multiply that prior with my output probability:
$$ p(y_{t+1}|x_{1:t}) = p(y_{t+1}|y_t) p(y_{t+1}|x_{1:t+1}) p(y_t|x_{1:t}). $$ But this is wrong since $p(y_{t+1}|x_{1:t+1})$ and $p(y_t|x_{1:t})$ are not independent.