5

I am considering the following hypothetical situation: I have a time series of data. In general, 'the public' should have access to features of this data. However, making the time series available would constitute a privacy leak. I am considering making a moving average available instead.

Can anyone recommend either some literature on this, or some alternative methods?

I understand that this is a case by case question. However, I think there should be a general answer available along the following lines:

1) Privacy leaks occur because you can match up the time stamp to an individual, by using outside information.

2) Therefore, you want to make it so that each window aggregates the data of several individuals. (The data is of a form where the mean is a meaningful quantity.)

There are probably adversarial ways to break this privacy, if one is sufficiently determined. I think in this case no one is. So I'm looking for literature that deals with some real world case studies, if possible.

(This situation is hypothetical. I do not have access to the data. I am 'the public' that wants the data, and I want to suggest a reasonable approach for aggregation.)

In general, the moving average is not invertible. However, it's plausible that there are some situations when data can be leaked in a clever way.

Cross posted here: https://datascience.stackexchange.com/questions/26851/privacy-through-moving-averages

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Elle Najt
  • 211
  • 1
  • 7

2 Answers2

2

There are at least few ways how you could calculate the moving average, but basically for $i$-th point, the moving average with width $2h+1$ is defined in terms of moving sum

$$ z_i = x_{i - h} + \dots + x_{i - 1} + x_i + x_{i +1} + \dots + x_{i + h} $$

and the next point is $z_{i+1} = z_i - x_{i-h} + x_{i+h+1}$, then you convert it to average by dividing by the width. What is problematic is the border cases $x_1$ and $x_n$ and there are several approaches how you can deal with them, for example take $z_1 = x_1$, $z_2 = x_1 + x_2$, $\dots$, or take $z_1 = x_1 + x_2 + \dots + x_h$, $z_2 = z_1 + x_{h+1}$, $\dots$, etc. Notice that in both of the above cases you can easily de-anonymize the values by taking differences between the $z_{i+1}$ and $z_i$ values, when you de-anonymize the border cases, then you can proceed to get the rest of the values. So if you publish the full series, then it does not provide any anonymization. It is an example of security by obscurity, a bad practice.

There are probably adversarial ways to break this privacy, if one is sufficiently determined. I think in this case no one is.

If you are saying that your data does not need any protection, then why would you bother with pseudo-anonymization? If it needs it, then it needs something that works.

You probably would like to read more about differential privacy. TLDR; why won't you simply add a random noise to your data? The more noise you add, the more secure it is, the less, the more precise it is.

Tim
  • 108,699
  • 20
  • 212
  • 390
1

Differential privacy is well-suited to this use case. This paper proposes a relatively simple solution to what you want to do (releasing moving averages) without adding too much noise. I believe more efficient techniques have been found since, but you should start with this and see how it goes; and then look in the citations of this paper for improvements.

Ted
  • 308
  • 2
  • 10