I am learning analytics online and have some quick questions.
Usually when we do analysis, why is that we usually ignore the items/data points that are less frequent?
Let's say for ex: we have a frequency data of drugs and no of patients consumed consumed that drug. As an example, the data looks like as shown below but in real-time, I might even have millions of records
From the above screenshot, we can know that whatever analysis and insights we come up with the above data (including few more columns of data which aren't shown here), we will definitely not consider Drug D
.
What I am trying to do is, manually map the drug names to the terms available in the dictionary (Data preparation task). As you can see in the screenshot, Drug A
is mapped to ABCDE A
. Similarly, I have to manually map for all 50K drugs. However, my question is here
a) I can't spend resources (money/people) to manually (as it cannot be automated) go through all the 50K drugs and map it to dict terms because no one is interested to do this job. Whoever is interested, is not willing and it would be impossible to do all the 50K drugs and it would incur so much money to pay them. So, I have to make sure that manual reviewers focus on important (high frequent) terms first and it's even okay to ignore DRUG D
or DRUG G
because they contribute very little value to the data (considering the full dataset of million records)? Question is mainly on decision making based on a systematic approach/mathematical approach rather than my judgment/visual inspection/subjective..
b) Hence now, I am trying to know whether there is any objective/systematic/mathematical approach that can tell me, we can ignore all drugs below a certain N%
etc... Because I can't just say that through visual inspection I feel Drug G
and Drug D
can be ignored. If you are saying that we can use the Statistical significance test
, can you please guide me on how can I set this as a problem? Because I usually see, it is used in hypothesis testing. Can I kindly request you to guide me on this?