I am trying to learn about machine learning using Accord.Net.
I created a project that has a number of labeled states that represent screens a user visited in a sequence. I created some unit tests to submit a series of historical sequences and a new observed sequence and calculate the probability that new sequence fits.
My basic case unit tests work fine, but once I expand it to my test data it throws errors. Each historical sequence contains 50-60 items and the observations is a sequence of 26 items. My method calculates 30 distinct symbols (screens) included in the dataset.
However, my final assertion calculated with "actual = _engine.CheckConfidence(history, observed)" throws the error "Index out of Bounds" at the Learn() method and appears to be related to the symbols provided. I must be misunderstanding the usage here, but if there are 30 screens then I should have 30 symbols and 30 potential states correct?
My implementation...
public int CheckConfidence(int[][] historical, int[] observations)
{
IEnumerable<int> symbols = observations.Distinct();
foreach (int[] array in historical)
{
symbols = symbols.Union(array.Distinct()).Distinct();
}
double probabality = getLikelihood(historical, observations, symbols.Count(), symbols.Count());
//caclulate confidence on percent scale
return (int)(Math.Round(probabality, precision) * 100);
}
private double getLikelihood(int[][] historical, int[] observations, int states, int symbols)
{
HiddenMarkovModel hmm = new HiddenMarkovModel(states, symbols);
BaumWelchLearning teacher = new BaumWelchLearning(hmm) { Tolerance = 0.001, };
teacher.Learn(historical);
return Math.Exp(hmm.LogLikelihood(observations));
}
...and my unit tests.
public void GetProbabilityTest()
{
try
{
//test perfect case
int[][] history = new int[][]
{
new int[] { 0, 1, 0, 1 },
new int[] { 0, 1, 0, 1 },
new int[] { 0, 1, 0, 1 },
new int[] { 0, 1, 0, 1 },
};
int[] observed = new int[] { 0, 1, 0, 1 };
double actual = _engine.CheckConfidence(history, observed);
Assert.AreEqual(100, actual); //100%
//test prefectly WRONG case
history = new int[][]
{
new int[] { 0, 1, 0, 1 },
new int[] { 0, 1, 0, 1 },
new int[] { 0, 1, 0, 1 },
new int[] { 0, 1, 0, 1 },
};
observed = new int[] { 2, 2, 2, 2 };
actual = _engine.CheckConfidence(history, observed);
Assert.AreEqual(0, actual); //0%
//do again with real numbers
history = _engine.GetHistoricalData(_historicalData.ToList());
observed = _engine.GetNewObservations(_newObservations.ToList());
IEnumerable<int> symbols = observed.Distinct();
foreach (int[] array in history)
{
symbols = symbols.Union(array.Distinct()).Distinct();
}
actual = _engine.CheckConfidence(history, observed);
Assert.AreEqual(0, actual); //0%
}
catch (Exception ex)
{
Assert.Fail("Exception caught: " + ex.Message);
}
}
StackTrace...
at Accord.Statistics.Distributions.Univariate.GeneralDiscreteDistribution.Fit(Int32[] observations, Double[] weights, GeneralDiscreteOptions options)
at Accord.Statistics.Distributions.Univariate.GeneralDiscreteDistribution.Fit(Int32[] observations, Double[] weights)
at Accord.Statistics.Models.Markov.Learning.BaseBaumWelchLearning`4.Fit(Int32 index, TObservation[] values, Double[] weights)
at Accord.Statistics.Models.Markov.Learning.BaseBaumWelchLearning`4.UpdateEmissions()
at Accord.Statistics.Models.Markov.Learning.BaseBaumWelchLearning`4.Learn(TObservation[][] x, Double[] weights)
at AnomalyDetector.Engines.BehaviorEngine.getLikelihood(Int32[][] historical, Int32[] observations, Int32 states, Int32 symbols) in C:\Git\anomalydetector\AnomolyDetector.Core\Engines\BehaviorEngine.cs:line 134
EDIT: So I have figured out through unit testing that it begins to error out at 20 symbols. I haven't been able to get Accord.Net to build locally yet, but looking at the code I don't see why this would be constrained at 20.
EDIT: Still struggling to figure out if I'm providing the right input or this is an Accord bug.
for (int i = 0; i < observations.Length; i++)
p[observations[i] - start] += weights[i] * observations.Length;
I got it building locally and the error is in Accord.Statistics\Distributions\Univariate\Discrete\GeneralDiscreteDistribution.Fit(int[] observations, double[] weights, GeneralDiscreteOptions options) observations is length 150 while p is length 29 so when the loop reaches index = 29 it blows up. I was under the impression symbols should be the unique (distinct) symbols in the set e.g. the alphabet. Setting the symbols to 150 actually avoids this error, but then my results are useless as I never seem to get a likelihood result that is positive even when passing in one of the historical arrays as the new observations.