8

I am a doctor so please be kind with me and my basic understanding of statistics.

I have a dataset consisting of patients and their visits and I have labelled the presence of a specific kind of mole in their left and/or right hand with {0,1} values (0 = not present and 1 = present). The dataset looks like this:

**I removed it since the answers are provided; I can send it upon new request

So, that means that patient A1-001 had 6 visits with no presence of mole in his right hand during all visits and present of mole in his left hand in all visits except the first one.

I am interested in finding the probability of a hand developing a mole among only the patients that developed a mole in one hand and finding the probability of developing a mole in the other hand (given that the patient had already a mole in the other hand).

Furthermore, I want to know what is the probability of developing a mole within visits among the patients that developed a mole at some point in both hands

Could you help me model these simple questions?

azal
  • 159
  • 8
  • "Furthermore, I want to know what is the probability of developing a mole in the same visit among the patients that developed a mole at some point in both hands." — But you're coding moles per hand as present or absent rather than counting the number of moles, so if a patient already has a mole on each hand, how would an additional mole be visible in the data? – Kodiologist Jun 15 '18 at 15:27
  • @Kodiologist Yes, I am only interested in the presence and not the number of moles. If a patient has already a mole on each hand, then it is not possible to have an extra one: it is only possible to stay with this mole or the mole to disappear. – azal Jun 15 '18 at 15:34
  • 2
    Providing the complete data set may clarify the question and assist with receiving an answer. – Todd D Jun 15 '18 at 17:44
  • @Todd I don't get how providing the whole dataset will change the solution to the problem. I'm not a mathematician but I believe the problem is well-defined even with this sample size. I guess solving the problem for N = 3 (number of patients) will be the same as solving the problem for N = 100. – azal Jun 18 '18 at 14:47
  • To be honest, I did not imagine that it's gonna be so difficult even for mathematicians ^_^ – azal Jun 18 '18 at 15:17
  • 1
    @laza, mathematics is not difficult for mathematicians. But you pose a problem that is not about mathematics and instead about trying to understand what you mean (that is why a larger data-set was asked)...... what do you mean by *"I want to know what is the probability of developing a mole in the same visit among the patients that developed a mole at some point in both hands."* ? You did not answer that question from Kodiologist, – Sextus Empiricus Jun 20 '18 at 14:42
  • @MartijnWeterings I uploaded an extended dataset as requested. The question from Kodiologist is answered I think :/ – azal Jun 20 '18 at 14:54
  • @laza you explained something about presence instead of number, but you did not explain the sentence (which is confusing in the case that you are only counting presence and not number). – Sextus Empiricus Jun 20 '18 at 14:56
  • What is 'the same visit' and what do you mean exactly by ' the patients that developed a mole at some point in both hands' (this seems simple but there is some ambiguity)? – Sextus Empiricus Jun 20 '18 at 14:57
  • How do moles develop? How long do they last? Are we looking at the same mole if there is in two visits a mole? Currently we can only know the rate of presence. But not so much the probability of developing (which is currently a bit undefined and if it is gonna be related to a time period then the data, without time stamps, is not gonna be sufficient). – Sextus Empiricus Jun 20 '18 at 15:00
  • The most important question is: What is the probability of a patient to develop a mole in his hand given that at some point he has a mole in his other hand (or has currently). I'm trying to figure out if there is a relation of mole appearance in one hand given that that there has appeared in the other. – azal Jun 20 '18 at 15:18
  • The time stamps are the patient ID: when you see the repetition, it means it is a different visit (always forward in time) – azal Jun 20 '18 at 15:19
  • laza, your model would improve if time stamp is actually a 'time' or 'date' and not just the order of visits. How else can the time difference between two visits be separated? It is also very important to know what the reasons are for the people to visit. Are these regular visits or do patient more likely show up when they detect a mole? You need data for which it is clearly described how it has been gathered. For instance, if the data is from a mole detection center then you are gonna detect a lot of moles among patients. But that does not mean an extremely high probability of getting moles. – Sextus Empiricus Jun 21 '18 at 16:02
  • @MartijnWeterings I added the time stamps – azal Jun 21 '18 at 16:11
  • As an aside: it would be wise to anonymize the datestamps (e.g. add a random number of days to each), as in principle this data could be used with other data sets (e.g. loyalty card transactions in the bakery opposite the surgery) to identify the patients. In practice it would unreasonably hard, of course. (And maybe you already have, or maybe this is artificial data.) – Darren Cook Jun 21 '18 at 19:57
  • Does B2-126 (mole on RH, then cured, then mole on LH) count towards your first question ("finding the probability of developing a mole in the other hand (given that the patient had already a mole in the other hand).") ? – Darren Cook Jun 21 '18 at 20:02
  • @DarrenCook Yes, it doesn't have to be consecutive. – azal Jun 22 '18 at 09:44
  • @DarrenCook I have randomly changed the dates within the same range of visits – azal Jun 26 '18 at 16:19

3 Answers3

5

I personally feel this lends itself well to a survival analysis.

You have people without moles in a certain hand at the start of the period (your at risk population); you can select these, and you have time points for follow-up and whether or not they were censored (developed a mole). This gives you a hazard for whatever cohort you've selected.

You can then calculate a hazard ratio (e.g. for developing a right-hand mole in people with a left-hand moles at baseline, versus those without). This could be expressed on a Kaplan-Meier graph and will come with a confidence interval.

James
  • 453
  • 5
  • 12
  • Hi @James, I think I will give a shot to this https://lifelines.readthedocs.io/en/latest/Quickstart.html#kaplan-meier-and-nelson-aalen What do you think? – azal Jun 22 '18 at 10:18
  • I’m sure that’s fine. Whilst I love python I generally prefer R for stats, but this seems reasonably well supported. – James Jun 22 '18 at 10:41
  • can you give me a hint or two wrt bringing the data to the correct format? – azal Jun 22 '18 at 12:32
  • Like it says you need to know the time people were observed for and when they 'died' (ie got a mole) or the last time they were seen if they didn't get a mole. So for every patient, track the time from where you first saw them without a mole, to the time they got the mole or were last seen. That's the 'T' column in the example link. The 'E' column is whether they got a mole or not. You then need 1 row per patient. – James Jun 22 '18 at 12:43
  • But what happens if the patient had a mole immediately on the first visit? And in another question, why do you think Markov chains are not suitable for this problem? It's a transition problem and from what I read, they seem very suitable to tackle these kind of problems. – azal Jun 22 '18 at 14:17
  • If they have a mole at the next visit you're right in that you can't say when in that interval it developed, but it's traditional to say the event happened at the time of follow-up. The fact the data represents are ratio of hazards hopefully means that this shortcoming is at least balanced across the two arms. I'm afraid I don't have enough experience of Markov Chains to answer your second question, though. – James Jun 22 '18 at 14:21
0

There is no modeling to be done here, all of your questions are simple conditional probabilities.

Alright, since people did not appreciate that answer, you need to clarify a couple of things.

I am interested in finding the probability of a hand developing a mole among only the patients that developed a mole in one hand and finding the probability of developing a mole in the other hand (given that the patient had already a mole in the other hand).

Do you mean per visit? Or that they never developed a mole ever? From your example:

Patients 1 and 3 developed a mole on one hand. Patient 1 never developed a mole on the other hand but patient 3 did, so you could argue the answer to your question is 50%. Now, you could also argue that patient 1 had 4 checkups with 1 mole and not on the other and patient 3 had 0 checkups with 1 mole and not the other so the probability could be 1/5 = 20%. It depends on how you define your question.

astel
  • 1,388
  • 5
  • 17
  • Thank you for your reply. Can you help me even with that? I would really appreciate it. Some colleagues of mine, though, told me to use longitudinal modelling for the data or Bayesian statistics. These do not apply here I guess? – azal Jun 15 '18 at 15:21
  • 3
    This post does not answer the question, because any statement about a probability is intrinsically a model. The important issue is "what model is it (or should it be)?" – whuber Jun 15 '18 at 15:45
0

Personally, I think you can start by studying the multicovariance generalized linear models: https://cran.r-project.org/web/packages/mcglm/index.html

https://cran.r-project.org/web/packages/mcglm/vignettes/GLMExamples.html

http://cursos.leg.ufpr.br/mcglm4aed/slides/2-mcglm.html#(1)

Those models are apropriated for when you have more than one response variable and they're not gaussian, and this is your case, as you have two binary variables (mole or not mole in each hand). Also, the method lets you deal with intra-individual dependencies, which is given by the longitudinal structure. Here, longitudinal means repeated measures for the same individual, along the time.

I think the links above will help you to have a good idea about these techniques, and they also provide the computational implementation in R.

Bruna w
  • 513
  • 2
  • 9