How to include prior knowledge that a model might be able to figure out itself

Question

I have a problem where I want to predict the outcome of a sequence given another sequence online. Let $(x_1, x_2, ... x_T)$ be denoted by $x_{1:T}$, then I am estimating: $$ p(y_T|x_{1:T}) $$ where $y_t \in Y, x_t \in X~\forall t$. That is, given a history of observations I want to predict the outcome of another variable. In practice, I predict all the $y_{1:T}$, but since I do it online I predict $y_t$ independently of $y_{t'}$ where $t' > t$.

I do that with a discriminatively trained nonlinear state space model, a recurrent neural net which is trained with gradient methods based and the negative log likelihood as a loss. (This can be thought of as a hidden Markov model with dirac distributions everywhere.)

What I am wondering is, how I can incorporate any knowledge I have about $p(y_{t+1}|y_t)$.

Example: I am modelling the position of a car $y_t$ given a history of motor commands $x_{1:t}$. I know the maximum velocity $v$ of a car, thus I could assume that $$p(y_{t+1}|y_t) \propto \begin{cases} 0~\text{if}~y_{t+1} - y_t > v \\ 1~\text{else} \end{cases} $$

My question now is: how to make use of that?

One attempt would be to penalize those hypotheses which produce output probabilities which violate $p(x_{t+1}|x_t)$. I tried this with adding a regularizer to the log likelihood during parameter estimation, but it did not work out pretty well.

Another guess is to just multiply that prior with my output probability:

$$ p(y_{t+1}|x_{1:t}) = p(y_{t+1}|y_t) p(y_{t+1}|x_{1:t+1}) p(y_t|x_{1:t}). $$ But this is wrong since $p(y_{t+1}|x_{1:t+1})$ and $p(y_t|x_{1:t})$ are not independent.

Can you adapt the way we compute predictive densities in Bayesian analysis to suit your problem? Briefly, we have $x_1,\dots,x_n,x_{n+1}$ conditionally iid given parameter $\theta$. Then, $p(x_{n+1}\mid x_1,\dots,x_n)=\int p(x_{n+1},\theta\mid x_1,\dots,x_n)\,d\theta=\int p(x_{n+1}\mid\theta,x_1,\dots,x_n)p(\theta\mid x_1,\dots,x_n)\,d\theta=\int p(x_{n+1}\mid\theta)p(\theta\mid x_1,\dots,x_n)\,d\theta$. — Zen, Aug 10 '13 at 21:27

How to include prior knowledge that a model might be able to figure out itself

0 Answers0