My question is why do we use exponential as the nonlinearity function here (or why choose $\log()$ as the link function)?
As AlaskaRon states, it is because it is the canonical form. This has many desirable properties. In general, we can write the density, $f$, of the exponential family as
$$\log f(y;\theta,\tau) = \log h(y, \tau) + b(\theta)T(y) - A(\theta) - d(\tau)$$
using the same notation as wiki where $\tau$ is the dispersion parameter and $\theta$ is related to the mean. We are working with a canonical form when $b$ is the identity function
$$\log f(y;\theta,\tau) = \log h(y, \tau) + \theta T(y) - A(\theta) - d(\tau).$$
This is the case with the Poisson model with the log link. You can see this by setting $\eta = \log \theta = \log \lambda$ and $\tau = 1$ with $d(\tau) = 1$ as
$$\begin{align*}
\log P(Y = y) &= y\log \lambda - \lambda -\log y! \\
&= \underbrace{-\log y!}_{\log h(y,\tau)} + y\eta-\underbrace{\exp\eta}_{A(\eta)}
\end{align*}$$
One advantage with the canonical form is that the mean is
$$\text{E}(y; \theta,\tau) = A'(\theta)$$
and the variance is
$$\text{Var}(y; \theta,\tau) = A''(\theta)d(\tau)\geq0$$
Given that $d(\tau) > 0$ this implies that $\partial^2/\partial\theta^2\, \log f(y;\theta,\tau) \leq 0$ so the density is concave in $\theta$ which is nice for maximum likelihood estimation. There are further nice properties in terms of the moment generating function, that $T(y)$ is also the identity for many distributions, and more properties which are useful in a variety of applications.
See also this question, this question, and this answer.
Why not use any other positive, monotonically increasing function?
You can use another link function. You will not have all the nice properties which you get with the canonical form but the link function you choose may be a better approximation of the data generating process.