I am trying to figure out how best to encode ICD10 codes for input into a machine learning model.
It isn't ordinal by any means, however, there is a sort of logic you can apply to just the labels that will tell you which ones can be grouped together without any additional knowledge. In some ways, it's closer to an interval data type in that the difference between consecutive codes frequently (but by no means always) has some kind of meaning.
This makes me think that binary encoding may be more effective than one-hot, but are there other options I should be considering?
for example:
I71.1 Thoracic aortic aneurysm, ruptured
I71.2 Thoracic aortic aneurysm, without mention of rupture
I71.3 Abdominal aortic aneurysm, ruptured
I71.4 Abdominal aortic aneurysm, without mention of rupture
J12.2 Parainfluenza virus pneumonia
J12.3 Human metapneumovirus pneumonia
J12.8 Other viral pneumonia
J12.9 Viral pneumonia, unspecified