18

I'm looking for some robust techniques to remove outliers and errors (whatever the cause) from financial time-series data (i.e. tickdata).

Tick-by-tick financial time-series data is very messy. It contains huge (time) gaps when the exchange is closed, and make huge jumps when the exchange opens again. When the exchange is open, all kinds of factors introduce trades at price levels that are wrong (they did not occur) and/or not representative of the market (a spike because of an incorrectly entered bid or ask price for example). This paper by tickdata.com (PDF) does a good job of outlining the problem, but offers few concrete solutions.

Most papers I can find online that mention this problem either ignore it (the tickdata is assumed filtered) or include the filtering as part of some huge trading model which hides any useful filtering steps.

Is anybody aware of more in-depth work in this area?

Update: this questions seems similar on the surface but:

  • Financial time series is (at least at the tick level) non-periodic.
  • The opening effect is a big issue because you can't simply use the last day's data as initialisation even though you'd really like to (because otherwise you have nothing). External events might cause the new day's opening to differ dramatically both in absolute level, and in volatility from the previous day.
  • Wildly irregular frequency of incoming data. Near open and close of the day the amount of datapoints/second can be 10 times higher than the average during the day. The other question deals with regularly sampled data.
  • The "outliers" in financial data exhibit some specific patterns that could be detected with specific techniques not applicable in other domains and I'm -in part- looking for those specific techniques.
  • In more extreme cases (e.g. the flash crash) the outliers might amount to more than 75% of the data over longer intervals (> 10 minutes). In addition, the (high) frequency of incoming data contains some information about the outlier aspect of the situation.
jilles de wit
  • 361
  • 1
  • 4
  • 15
  • 1
    I don't think this is a duplicate because of the nature of the data. The problem discussed on the other question concerned regularly observed time series with occasional outliers (at least that's how I interpreted it). The nature of tick-by-tick data would lead to different solutions due to the exchange opening effect. – Rob Hyndman Aug 04 '10 at 10:09
  • possible duplicate of [Simple algorithm for online outlier detection of a generic time series](http://stats.stackexchange.com/questions/1142/simple-algorithm-for-online-outlier-detection-of-a-generic-time-series) This question is proposed to be closed as a duplicate. Could you please let us know at the meta thread if and how your context is different from the question I linked? –  Aug 04 '10 at 10:10
  • @Rob But the exchange opening effect only determines when you have to run the algorithm. The fundamental issue remains the same. Even in network data you have the 'office opening effect' where traffic peaks as soon as an office opens. At the very least, the OP should link to that question, scan the answers there and explain why the solutions there do not work so that a suitable answer can be posted for this question. –  Aug 04 '10 at 10:12
  • 1
    I agree with @Rob. This kind of data can pose unique challanges, so this is not a duplicate. – Shane Aug 04 '10 at 10:46
  • This question might eventually get served better here due to its domain-specificity: http://area51.stackexchange.com/proposals/117/quantitative-finance – Shane Aug 04 '10 at 11:09
  • 1
    I think it belongs here. The question is about analyzing irregularly spaced, very noisy time series. Have you had a look at "An Introduction to High-Frequency Finance" by Dacorogna, Olsen and a bunch of others? Or the papers by the same authors? – PeterR Aug 04 '10 at 12:30
  • I saw the other answer and followed Rob's reasoning. I amended my question to address differences I see. – jilles de wit Aug 04 '10 at 13:18
  • @jilles I do not see any edits to your question. Did you save your edits? It may help if you post a link to that question as well and indicate the changes to your question by something like 'Edit'. –  Aug 04 '10 at 13:39
  • I have the Olsen book and it doesn't address the exchange open/close question. – Shane Aug 04 '10 at 13:39
  • @Srikant: done, @PeterR: do you know of any specific paper of those authors that addresses this question? – jilles de wit Aug 04 '10 at 13:46
  • I wish there is a way to undo my close vote! I think it is clear now that it is not a duplicate. –  Aug 04 '10 at 13:49
  • [Found this related article that describes a multi-stage algorithm:](https://link.springer.com/article/10.1057/jdhf.2009.16) – Aharon Z. Jan 11 '22 at 16:19

3 Answers3

14

The problem is definitely hard.

Mechanical rules like the +/- N1 times standard deviations, or +/ N2 times MAD, or +/- N3 IQR or ... will fail because there are always some series that are different as for example:

  • fixings like interbank rate may be constant for some time and then jump all of a sudden
  • similarly for e.g. certain foreign exchanges coming off a peg
  • certain instrument are implicitly spreads; these may be near zero for periods and all of a sudden jump manifold

Been there, done that, ... in a previous job. You could try to bracket each series using arbitrage relations ships (e.g. assuming USD/EUR and EUR/JPY are presumed good, you can work out bands around what USD/JPY should be; likewise for derivatives off an underlying etc pp.

Commercial data vendors expand some effort on this, and those of use who are clients of theirs know ... it still does not exclude errors.

Dirk Eddelbuettel
  • 8,362
  • 2
  • 28
  • 43
  • +1 yes, nothing is perfect. Tickdata.com (whose paper is mentioned) also includes outliers and they also strip out too much good data (when compared with another source). Olsen's data is close to being terrible, and I generally just indicative. There's a reason that banks pay big operations teams to work on this. – Shane Aug 04 '10 at 16:10
  • I like your idea about using known arbitrage relations. have you tried this at all in your previous job? – jilles de wit Aug 05 '10 at 07:27
  • No, we never fully formalized that. But I think we used some simple ones (ie ETF vs underlying index etc). It's been a few years though. – Dirk Eddelbuettel Aug 05 '10 at 11:43
8

I'll add some paper references when I'm back at a computer, but here are some simple suggestions:

Definitely start by working with returns. This is critical to deal with the irregular spacing where you can naturally get big price gaps (especially around weekends). Then you can apply a simple filter to remove returns well outside the norm (eg. vs a high number of standard deviations). The returns will adjust to the new absolute level so large real changes will result in the loss of only one tick. I suggest using a two-pass filter with returns taken from 1 step and n steps to deal with clusters of outliers.

Edit 1: Regarding the usage of prices rather than returns: asset prices tend to not be stationary, so IMO that can pose some additional challenges. To account for the irregularity and power law effects, I would advise some kind of adjustment if you want to include them in your filter. You can scale the price changes by the time interval or by volatility. You can refer to the "realized volatility" literture for some discussion on this. Also discussed in Dacorogna et. al.

To account for the changes in volatility, you might try basing your volatility calculation from the same time of the day over the past week (using the seasonality).

Shane
  • 11,961
  • 17
  • 71
  • 89
  • By using only the returns you become very vulnerable to ladders (i.e. a sequence of prices that climbs or drops away from the norm, where each individual return is acceptable, but as a group they represent an outlier). Ideally you'd use both the return and the absolute level. – jilles de wit Aug 04 '10 at 14:01
5

I have (with some delay) changed my answer to reflect your concern about the lack of 'adaptability' of the unconditional mad/median.

You can address the problem of time varying volatility with the robust statistics framework. This is done by using a robust estimator of the conditional variance (instead of the robust estimator of the unconditional variance I was suggesting earlier): the M-estimation of the GARCH model. Then you will have a robust, time varying estimate of $(\hat{\mu}_t,\hat{\sigma}_t)$ which are not the same as those produced by the usual GARCH fit. In particular, they are not driven by a few far away outliers. Because these estimate are not driven by them, you can use them to reliably flag the outliers using the historical distribution of the

$$\frac{x_t-\hat{\mu}_t}{\hat{\sigma}_t}$$

You can find more information (and a link to an R package) in this paper:

Boudt, K. and Croux, C. (2010). Robust M-Estimation of Multivariate GARCH Models.

user603
  • 21,225
  • 3
  • 71
  • 135
  • I've tried something like this, but this method is not very good at dealing with abrupt changes in the volatility. This leads to underfiltering in quiet periods and overfiltering during more busy times. – jilles de wit Aug 04 '10 at 13:53
  • I do not understand this "This leads to underfiltering in quiet periods and overfiltering during more busy times" care to explain ? – user603 Aug 06 '10 at 16:25
  • In quiet periods price volatility tends to be lower, so prices closer to the mean can be considered outliers. However, because you use MAD for (presumably) an entire trading day (or even longer) these outliers are less than 3 MAD away from the median and will not be filtered. The reverse is true for busy periods with higher price movements (acceptable price movements will be filtered). Thus the problem reduces to properly estimating the MAD at all times, which is the issue to begin with. – jilles de wit Sep 14 '10 at 13:39