1

Problem statement

I am trying to construct a model that predicts stock price volatility on a given day based on data points represented as strings that may or may not be present on that day. My hypothesis is that certain combinations of these data points correlate to different levels of volatility in stock price, but I don't know what those combinations are. Out of about 2100 unique potential data points, there will only be 10-20 on a given day. Therefore, I'm looking for a method that can visualize/display the rate of cooccurence out of a grab-bag of these data points bucketed by volatility.

An example of my dataset looks roughly like this:

percentage_volatility_change_bucket,data_point,count
2,X a X,3438
2,X a Y,4056
2,X b X,3678
2,X b Y,6411
2,Y a Z,6503
2,Y b Z,6434
...
-5,X a X,3438
-5,X a Z,4056
-5,X b Z,3678
-5,X b Y,6411
-5,Y a X,6503
-5,Y b X,6434
  • percentage_volatility_change_bucket is one of a set of predefined values which means that it the % change in volatility on the day the data point was present was between the value and the next higher (or lower, for loss days) day.
  • data_point is the name of the data point. Technically it is composed of three categories but that's irrelevant because I tally them as strings.
  • count is the number of times that data point was present on a day that a stock had the percentage change in volatility.

Ideally, the end result is I have an easy-to-consume visualization or display that shows, grouped by change in volatility, the groupings of data points that occurred together most frequently. There will be overlap between these groupings which makes it harder for me to imagine how to visualize this.

What I've tried

My initial intuition is that this is a clustering problem. I've tried constructing a graph with the data points connected nodes that represent buckets of days that had a certain level of volatility, and edges to other data points where the weight is equal to how many times that data point appeared. However, because of the count of data points (2100 * n buckets of volatility change) and the fact that the same data points have the same names across various days, this shows up as a huge mass of nodes.

I've also considered turning my graph into a Markov chain and generating the output related to a given high-volatility day, but this seems inaccurate. I also believe that there are different combinations of data points which can lead to high volatility which means there may not always be the same data points present, so a Markov chain could "wander" through the data points and not accurately represent the significant component parts of these groupings of data points.

Other thoughts

I'm at a loss for how to visualize these combinations. I think this problem is also made more difficult by the fact that on a given day there are at most 20 of these data points, and the large range of them makes a matrix-style visualization tricky. I suspect there may be machine learning systems that can help solve this problem but I'm not familiar enough with the problem space to know what to apply to this problem. I also think that there are probably more effective ways to conduct my clustering visualizations based on my data. Any and all suggestions and pointers are welcome and appreciated.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467

0 Answers0