Most Popular

1500 questions
51
votes
5 answers

Prediction in Cox regression

I am doing a multivariate Cox regression, I have my significant independent variables and beta values. The model fits to my data very well. Now, I would like to use my model and predict the survival of a new observation. I am unclear how to do this…
Marja
  • 513
  • 1
  • 5
  • 4
51
votes
8 answers

What is a good resource on table design?

I've seen various theoretical treatments of graphics, such as the Grammar of Graphics. But I have seen nothing equivalent with regards to tables. Over the while I have developed an informal model of good practice in table design. However, I'd like…
Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
51
votes
7 answers

When conducting a t-test why would one prefer to assume (or test for) equal variances rather than always use a Welch approximation of the df?

It seems like when the assumption of homogeneity of variance is met that the results from a Welch adjusted t-test and a standard t-test are approximately the same. Why not simply always use the Welch adjusted t?
russellpierce
  • 17,079
  • 16
  • 67
  • 98
51
votes
4 answers

Cumming (2008) claims that distribution of p-values obtained in replications depends only on the original p-value. How can it be true?

I have been reading Geoff Cumming's 2008 paper Replication and $p$ Intervals: $p$ values predict the future only vaguely, but confidence intervals do much better [~200 citations in Google Scholar] -- and am confused by one of its central claims.…
amoeba
  • 93,463
  • 28
  • 275
  • 317
51
votes
6 answers

Understanding LSTM units vs. cells

I have been studying LSTMs for a while. I understand at a high level how everything works. However, going to implement them using Tensorflow I've noticed that BasicLSTMCell requires a number of units (i.e. num_units) parameter. From this very…
user124589
51
votes
2 answers

Choosing the right linkage method for hierarchical clustering

I am performing hierarchical clustering on data I've gathered and processed from the reddit data dump on Google BigQuery. My process is the following: Get the latest 1000 posts in /r/politics Gather all the comments Process the data and compute an…
51
votes
3 answers

Different ways to write interaction terms in lm?

I have a question about which is the best way to specify an interaction in a regression model. Consider the following data: d <- structure(list(r = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),…
Manuel Ramón
  • 2,045
  • 4
  • 15
  • 16
51
votes
3 answers

How does centering make a difference in PCA (for SVD and eigen decomposition)?

What difference does centering (or de-meaning) your data make for PCA? I've heard that it makes the maths easier or that it prevents the first PC from being dominated by the variables' means, but I feel like I haven't been able to firmly grasp the…
Zenit
  • 1,586
  • 2
  • 17
  • 19
51
votes
1 answer

How to determine whether or not the y-axis of a graph should start at zero?

One common way to "lie with data" is to use a y-axis scale that makes it seem as if changes are more significant than they really are. When I review scientific publications, or students' lab reports, I am often frustrated by this "data visualization…
ff524
  • 727
  • 1
  • 5
  • 9
51
votes
4 answers

If the t-test and the ANOVA for two groups are equivalent, why aren't their assumptions equivalent?

I'm sure I've got this completely wrapped round my head, but I just can't figure it out. The t-test compares two normal distributions using the Z distribution. That's why there's an assumption of normality in the DATA. ANOVA is equivalent to linear…
Chris Beeley
  • 5,465
  • 5
  • 36
  • 40
51
votes
3 answers

How are we defining 'reproducible research'?

This has come up in a few questions now, and I've been wondering about something. Has the field as a whole moved toward "reproducibility" focusing on the availability of the original data, and the code in question? I was always taught that the core…
Fomite
  • 21,264
  • 10
  • 78
  • 137
51
votes
2 answers

Why does frequentist hypothesis testing become biased towards rejecting the null hypothesis with sufficiently large samples?

I was just reading this article on the Bayes factor for a completely unrelated problem when I stumbled upon this passage Hypothesis testing with Bayes factors is more robust than frequentist hypothesis testing, since the Bayesian form avoids model…
Louis Thibault
  • 643
  • 6
  • 6
50
votes
5 answers

Probability distribution for different probabilities

If I wanted to get the probability of 9 successes in 16 trials with each trial having a probability of 0.6 I could use a binomial distribution. What could I use if each of the 16 trials has a different probability of success?
Greg
  • 683
  • 2
  • 6
  • 7
50
votes
7 answers

Logistic Regression in R (Odds Ratio)

I'm trying to undertake a logistic regression analysis in R. I have attended courses covering this material using STATA. I am finding it very difficult to replicate functionality in R. Is it mature in this area? There seems to be little…
SabreWolfy
  • 1,101
  • 2
  • 15
  • 25
50
votes
7 answers

Why is "statistically significant" not enough?

I have completed my data analysis and got "statistically significant results" which is consistent with my hypothesis. However, a student in statistics told me this is a premature conclusion. Why? Is there anything else needed to be included in my…