3

I am trying to get a start on a clustering problem. The sample data is trade volume at a particular price. Some notes about the data:

  • number of bins vary from sample to sample (larger price range in the sample)
  • volumes vary (high vs low volume days)
  • range of indices vary from sample to sample (market in a different place at different time)

With that said, I would like to be able to find some clusters the general shapes of the histograms (tall/thin, short/wide, skewed toward the top/bottom, two humps, etc.).

Could someone give me some pointers on how to prepare the data and what algorithms might be particularly suited to this type of analysis?

Here are a couple samples from the data:

In [111]: profiles[100]
Out[111]: 
             0
1235.75    802
1236.00    802
1236.25    802
1236.50    802
1236.75   3410
1237.00   7452
1237.25   7452
1237.50   8324
1237.75   8607
1238.00   8607
1238.25  11294
1238.50  11190
1238.75   9178
1239.00   8668
1239.25   9710
1239.50  11036
1239.75  10909
1240.00   9597
1240.25   7295
1240.50   7295
1240.75   7594
1241.00   5018
1241.25   4398
1241.50   3766
1241.75   2875
1242.00   2476
1242.25   2111
1242.50    893
1242.75    893
1243.00    893
...        ...
1269.00  29895
1269.25  27924
1269.50  27170
1269.75  21205
1270.00  19460
1270.25  20509
1270.50  19763
1270.75  20707
1271.00  21122
1271.25  20498
1271.50  23487
1271.75  24899
1272.00  23027
1272.25  24805
1272.50  27185
1272.75  29477
1273.00  26555
1273.25  26665
1273.50  25465
1273.75  20654
1274.00  17710
1274.25  17224
1274.50  15067
1274.75  11654
1275.00   9127
1275.25   7968
1275.50   6950
1275.75   5765
1276.00   3924
1276.25   1358

[163 rows x 1 columns]

In [115]: profiles[203]
Out[115]: 
             0
1256.25   2709
1256.50   2709
1256.75   4887
1257.00   4887
1257.25   7341
1257.50   7341
1257.75  10523
1258.00   9471
1258.25  10787
1258.50   9989
1258.75   8939
1259.00   7918
1259.25   6594
1259.50   3219
1259.75   2483
1260.00   1903
1260.25   1118
1260.50   2861
1260.75   2861
1261.00   4663
1261.25   5059
1261.50   6833
1261.75   8940
1262.00  10070
1262.25   7573
1262.50   8746
1262.75   7811
1263.00   4579
1263.25   9609
1263.50  10623
...        ...
1270.25  21425
1270.50  20549
1270.75  19323
1271.00  23254
1271.25  28894
1271.50  29643
1271.75  27828
1272.00  31662
1272.25  29758
1272.50  30038
1272.75  35955
1273.00  39926
1273.25  36257
1273.50  42088
1273.75  41592
1274.00  34771
1274.25  20096
1274.50  15772
1274.75  15252
1275.00  11450
1275.25  17206
1275.50  15412
1275.75  21349
1276.00  19263
1276.25  14408
1276.50  12383
1276.75   4440
1277.00   3524
1277.25   1159
1277.50   1159

[86 rows x 1 columns]
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
jxstanford
  • 131
  • 3
  • 1
    There are many similar questions (with answers) on this site. Some: https://stats.stackexchange.com/questions/152519/simple-way-to-cluster-histograms, https://stats.stackexchange.com/questions/353416/cluster-daily-profiles-of-energy-consumption, https://stats.stackexchange.com/questions/25764/clustering-distributions but your requirement that binning might be different, is not covered, so there is no duplicate. – kjetil b halvorsen Apr 05 '20 at 16:07

0 Answers0