6

I am calculating the average distance in a 3-D random walk process after N steps. Each step is one unit long and the angle is randomly distributed around the origin. After N steps, what is the average distance from the origin?

The $X,Y,Z$ coordinates are determined by $Z=\cos(a),$ $X=\sin(a)\cos(b),$ and $Y=\sin(a)\sin(b).$ The angles $a$ and $b$ are distributed uniformly on $[0, 2\pi).$

I have simulated the process using 50,000 points. After 1 step, the average distance is 1. After 2 steps, the average distance is around 1.32. After 3 steps, the average distance is around 1.62.

How could I calculate the equation showing the average distance from the origin after N steps?

whuber
  • 281,159
  • 54
  • 637
  • 1,101
shaw shen
  • 73
  • 3
  • An exact calculation is truly messy and the length of its expression grows in proportion to $N.$ Would a good approximation be acceptable? Alternatively, the expected *squared* distance at step $N$ has a truly simple formula. – whuber Jan 04 '22 at 15:47
  • Not an answer but off the top of my head, I would attempt the problem for the 1-D case and then the 2-D case and see how those work first. There may not even be a closed form solution but I'm sure some of the probabilist wizards have attempted to solve that sort of problem. – mlofton Jan 04 '22 at 15:48
  • @mlofton The components follow uniform distributions, whence the sums of their squares follow distributions that are as complicated to express exactly as any [sum of uniforms](https://stats.stackexchange.com/a/43075/919). Although those expectations are readily computed, the same does not go for the square root. Normal approximations, the delta method, or simulations are needed. – whuber Jan 04 '22 at 15:50
  • 1
    @whuber Actually, I am looking for the equation when n is close to infinity. For a 2-D random walk I figure out that the average distance after N steps is sqrt(Nπ/4). – shaw shen Jan 05 '22 at 12:00
  • @whuber: That's interesting but above my pay grade. Shaw Shen, I would take whuber's advice because he knows this material infinitely more deeply than I do. I was thinking that the 1-D case would be more straightforward and possibly have a closed form solution but that doesn't seem to be the case. – mlofton Jan 05 '22 at 16:54
  • Given the setup in your second paragraph, $(X,Y,Z)$ would not be uniform on the unit-sphere. – Jarle Tufto Jan 05 '22 at 20:43

1 Answers1

5

Let's solve this in all dimensions $d=1,2,3,\ldots.$

The (vector) increments of the walk are $\mathbf{X}_i = (x_{1i}, x_{2i}, \ldots, x_{di}).$ After $n$ such independent steps the walk has reached the point $\mathbf{S}_n = \mathbf{X}_1 + \mathbf{X}_2 + \cdots + \mathbf{X}_n$ with corresponding components $s_{1n}, \ldots, s_{dn}.$ The question asks for the expectation of $|\mathbf{S}_n| = \sqrt{s_{1n}^2 + \cdots + s_{dn}^2}$ for large $n.$

Because the $\mathbf{X}_i$ are uniformly distributed on the unit sphere,

  1. Their components are identically distributed. (Thus, in particular, they have identical means, variances, and covariances. Details are given at https://stats.stackexchange.com/a/85977/919, but this additional information is not necessary for the following analysis.)

  2. Their means are all zero (since the spherical symmetry implies the means equal their own negatives and the boundedness of the vectors implies the means exist and are finite.)

  3. The variances of each $x_{ki}$ are all $1/d,$ because for any fixed $i,$ the sum of the variances of the $x_{ki}$ is the expectation of the sum of their squares, which is constantly $x_{1i}^2 + \cdots + x_{di}^2 = 1.$

  4. Their covariances are all zero. This is a bit of a surprise, because the sum-to-square restriction implies the components of any $\mathbf{X}_i$ are not independent. Nevertheless, the spherical symmetry of the distribution of $\mathbf{X}_i$ implies the distribution of $y_i=(x_{1i} + x_{2i} + \cdots + x_{di})/\sqrt{d}$ is identical to that of any of the components, whence
    $$\frac{1}{d} = \operatorname{Var}\left(y_i\right) = \frac{1}{d}\sum_{j,k=1}^d E\left[x_{ji}x_{ki}\right] = \frac{1}{d}\left(d\operatorname{Var}(x_{1i}) + d(d-1)\operatorname{Cov}(x_{1i}, x_{2i})\right).$$ Upon plugging in $1/d$ for the variance term on the right, we see the last term $d(d-1)\operatorname{Cov}(x_{1i},x_{2i})$ must be zero. Since either there is no covariance (for $d=1$) or else $d\gt 1,$ the covariance is zero, QED.

Because the increments are independent, the multivariate Central Limit Theorem (CLT) tells us the distribution of $\mathbf{S}_n$ is approximately multivariate Normal. The approximating Normal distribution's parameters are determined by the means and variances of the $\mathbf{X}_i:$ it will have zero mean, variances of $n/d,$ and zero covariances. Ergo,

the variables $(d/n)s_{kn}^2$ must be distributed approximately like squares of standard Normal variates and (therefore) their sum $(d/n)|\mathbf{S}_n|^2$ must be distributed approximately like the sum of squares of $d$ uncorrelated (whence independent) standard Normal variates.

By definition, a sum of independent standard Normal variables has a chi-squared distribution with $d$ degrees of freedom. Also by definition, its square root has a chi distribution with $d$ d.f. Its expectation is

$$ E\left[|\mathbf{S}_n|\right] = \sqrt{\frac{n}{d}}E\left[\sqrt{(d/n)s_{1n}^2 + (d/n)s_{2n}^2 + \cdots + (d/n)s_{dn}^2}\right] = \frac{\sqrt{2n}\,\Gamma((d+1)/2)}{\sqrt{d}\,\Gamma(d/2)}.$$

As a special case, when $d=2$ the right hand side is

$$\frac{\sqrt{2n}\,\Gamma((2+1)/2)}{\sqrt{2}\,\Gamma(2/2)} = \frac{\sqrt{n\pi}}{2},$$

exactly as noted in a comment to the question. When $d=3$ (the case of the question), the right hand side is

$$\frac{\sqrt{2n}\,\Gamma((3+1)/2)}{\sqrt{3}\,\Gamma(3/2)} = \frac{2\sqrt{2n}}{\sqrt{3\pi}}.$$

To illustrate the general formula, here is a plot of $\sqrt{2n}\,\Gamma((d+1)/2) / (\sqrt{d}\,\Gamma(d/2))$ for $d=3$ (in red) along with the means of 1,000,000 simulated random walks at times $1$ through $n=30.$ They look to be in good agreement, especially for $n\gt 1.$ The differences between the means and this formula approach zero at a rate of $O(n^{-1/2})$ (plotted in blue) or better, as predicted by the CLT.

Figure

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • Is the distance measure appropriate for d>3? I know that for 3d we use it habitually, but in d>>3 it doesn’t work very well in my opinion – Aksakal Jan 05 '22 at 19:10
  • +1 very nice answer. Here are some thoughts: let $e_d(n)$ be the expectation distance after $n$ steps on dimension $d$. Your answer suggests that, for $n$ sufficiently large, we should have the recursive relation $e_{d+1}(n) = \sqrt{\frac{d}{d+1}}\frac{d}{2} e_d(n)$. So the expected distance when we "go up" one dimension does not depend on $n$, but increases the by a factor of $d/2$. However, every step still has length $1$ for any dimension. What is causing this increase? – Lucas Prates Jan 05 '22 at 19:12
  • Moreover, for large $n$, could we just simulate for $d=1$ and translate results for large dimensions? I prefer to sample from a $\{-1,1\}$ coin than to simulate from those high-dimension spheres. – Lucas Prates Jan 05 '22 at 19:14
  • I think, unless we are specially interested in 3d, the distance measure should be scaled by $\sqrt d$ – Aksakal Jan 05 '22 at 19:14
  • @LucasPrates when you say every step has length 1, be careful. In two dimensions the step length one inch is on a circle that covers $\pi$ square inches of area, but in 3d the same step is on a sphere that includes $4/3\pi$ cubic inches of volume. So the step of the same length represents different things in different dimensional spaces – Aksakal Jan 05 '22 at 19:26
  • 1
    @Lucas You lost me at the outset: where did that recursion come from? It cannot be correct, because it would imply $e_d$ should be approximately proportional to $d!,$ which is way too large. – whuber Jan 05 '22 at 20:07
  • @Aksakal Nevertheless, lengths are directly comparable in every dimension. We rely on that all the time when we use rulers to measure distances on paper ($d=2$) or in space ($d=3$). Areas and volumes are not directly relevant in this problem. As to your initial comment, it's unclear what you might mean by Euclidean distance not "working" for $d\gg 3.$ There's certainly no particular problem for large $d$ in the present analysis. – whuber Jan 05 '22 at 20:09
  • @whuber I don't think lengths are comparable in different dimensions. I ran into an issue when applying Mahalanobis distance for forecast performance comparison, but Euclidian distance has the same problem. Imagine, you're forecasting a vector of variables, but the vector dimension changes. For instance, in some periods you forecast GDP and in others you don't. So, if you use Euclidian distance (root mean squared error), then you'll end up effectively penalizing the attempt to forecast more variables. This type of problems are more common than people think – Aksakal Jan 05 '22 at 20:15
  • @Aksakal I understand that steps of the same size "covers" different volumes, but I do not see why the volume would play a role when we only consider the length. – Lucas Prates Jan 05 '22 at 20:16
  • @whuber, so the solution that I used was to scale by $\sqrt d$, it may not be the best approach, but it alleviate the major bias in Euclidian distance in this application – Aksakal Jan 05 '22 at 20:16
  • 1
    @LucasPrates, see my comment reply to whuber, the volume represents the space you're dealing with. Ultimately, you're operating with the joint distribution function of rv, and the volume of the domain of the function matters. this is because the volume represents the number of states the system can be. in higher dimensions you live in a very large state space, so the distance should account for increased volume somehow, at least in some applications – Aksakal Jan 05 '22 at 20:19
  • @Aksakal Your comments are very far afield from the question we are concerned about in this thread! – whuber Jan 05 '22 at 20:20
  • @whuber, perhaps, but OP did not specify the distance measure, while you assumed it's Euclidian. what if OP meant Manhattan distance? – Aksakal Jan 05 '22 at 20:22
  • @Aksakal In a context where the OP is using spherical coordinates to generate data that are "one unit long," isn't it completely clear that their interest is in Euclidean distance? – whuber Jan 05 '22 at 20:25
  • 1
    @whuber There was miscalculation indeed, it is $e_{d+1}(n) = \sqrt{\frac{d}{d+1}}d/2\frac{\Gamma(d/2)^2}{\Gamma((d+1)/2)^2}e_{d}(n)$. So it seems that going up dimensions actually should have no effect on the expected length (the term before $e_d(n)$ seems to converge to $1$) ? – Lucas Prates Jan 05 '22 at 20:35
  • @Aksakal I agree that "going up" dimensions can have unexpected effects, for example recurrence/transience of random walks. But there, I could understand how the dimension plays a role. In this case, I did not. But my calculation was incorrect, so it seems there is no counter-intuitive phenomena happening here. – Lucas Prates Jan 05 '22 at 20:44
  • 1
    @Lucas That's an interesting observation. I think it has an intuitive explanation: in high dimensions ($d\gg 1$), each new increment will be nearly orthogonal to the previous one. Thus, each new step adds very nearly $1$ to the sum of squares of the components, whence $|\mathbf{S}_n|^2$ will be close to (but just a tiny bit less than) $n.$ – whuber Jan 05 '22 at 20:51
  • @whuber It is an interesting interpretation. Since the term in the relation is "free" of $n$, would that interpretation apply if $n >> d$? In that case, you would have so many increments that they might not be nearly orthogonal. – Lucas Prates Jan 05 '22 at 21:23
  • @Lucas Yes, although with large $n$ the increments would not be perfectly orthogonal, the *relative* difference between $|\mathbf{S}_n|^2$ and $n$ will still be small, regardless of $d.$ – whuber Jan 06 '22 at 02:16