3

I am learning analytics online and have some quick questions.

Usually when we do analysis, why is that we usually ignore the items/data points that are less frequent?

Let's say for ex: we have a frequency data of drugs and no of patients consumed consumed that drug. As an example, the data looks like as shown below but in real-time, I might even have millions of records

enter image description here

From the above screenshot, we can know that whatever analysis and insights we come up with the above data (including few more columns of data which aren't shown here), we will definitely not consider Drug D.

What I am trying to do is, manually map the drug names to the terms available in the dictionary (Data preparation task). As you can see in the screenshot, Drug A is mapped to ABCDE A. Similarly, I have to manually map for all 50K drugs. However, my question is here

a) I can't spend resources (money/people) to manually (as it cannot be automated) go through all the 50K drugs and map it to dict terms because no one is interested to do this job. Whoever is interested, is not willing and it would be impossible to do all the 50K drugs and it would incur so much money to pay them. So, I have to make sure that manual reviewers focus on important (high frequent) terms first and it's even okay to ignore DRUG D or DRUG G because they contribute very little value to the data (considering the full dataset of million records)? Question is mainly on decision making based on a systematic approach/mathematical approach rather than my judgment/visual inspection/subjective..

b) Hence now, I am trying to know whether there is any objective/systematic/mathematical approach that can tell me, we can ignore all drugs below a certain N% etc... Because I can't just say that through visual inspection I feel Drug G and Drug D can be ignored. If you are saying that we can use the Statistical significance test, can you please guide me on how can I set this as a problem? Because I usually see, it is used in hypothesis testing. Can I kindly request you to guide me on this?

The Great
  • 1,380
  • 6
  • 18
  • It is not true in general that infrequent observations "have no influence" on results, you can have single extreme outlier that can impact the results obtained using many other valid samples. – Tim Jun 13 '20 at 15:45
  • Hi @Tim - I updated the screenshot and post with little bit more info.Hope this can help you understand that problem that I am facing. – The Great Jun 14 '20 at 00:35
  • I'm not clear on what the purpose of this work is. Why are you mapping drugs to dictionary terms and what value does this create? How is the value related to frequency? – Ryan Volpi Jun 14 '20 at 03:04
  • Okay, `drug_name` column has values which are local names specific to out country but now we are trying to map it to a universal standard... Same drug can have different names in different countries due to brand, manufactures etc.. So now we all standardize it to universal common standard (`common data model`). By following this common standard,it can help in federated research and analysis – The Great Jun 14 '20 at 03:10
  • Since this is a manual work, we want to make sure that we map at least the most frequently occurring items/drugs... For ex: If I miss `drug G` or `Drug D`, it may not really make any difference because we can't arrive at any conclusion about drug treatment response etc because it is only used for very few people – The Great Jun 14 '20 at 03:12
  • Think about using regularization---if you have a factor with very many levels, see https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels – kjetil b halvorsen Feb 04 '21 at 13:55

0 Answers0