-1

I have two data sets (two different countries), and for each data set I have 3 variables: 2 independents (years and car model), 1 dependent (sales)

To make it more visual it would look something like this:

Country 1

Car_Model  Year    Sales
A          1       100
A          2       200
B          1       80
B          2       90
C          1       66  
C          2       20

Country 2

Car_Model  Year       Sales
A          1          120
A          2          220
B          1          82
B          2          92
C          1          62
C          2          22

I have been asked to perform a t-test to check if there is a significant difference between countries, but taking into consideration the variable car_model and year. However, I can not understand if it is possible to perform a t-test in such a problem, and which kind of t-test should I perform or how I should use the variables.

I think the idea here is to check the significant difference of model A in country 1 against model A in country 2, then model B in country 1 against model B in country 2, and so on. And at the end get a p value based on those previous p values. Is it possible to do this using a t-test in R/SPSS?

Stephan
  • 105
  • 1
  • 5
jaws234
  • 13
  • 3
  • 5
    You can compare sales in different countries but the key substantive question is whether it makes sense to do that independently of model and year. I guess not. This is rather too close to "how do I analyse my data?" to which a short answer is to think in terms of a regression. (t test not T test is standard notation.) – Nick Cox Aug 21 '18 at 10:38
  • I do not want just to compare sales in different countries, I know that is basic level. What I want to do here is to perform a t-test to check the significant differences between countries but taking into consideration model and year, checking the significant difference of model A in country 1 against model A in country 2, then model B in country 1 against model B in country 2, and so on. And at the end get a p value based on those previous p values. My teacher just told me to do a t-test in order to do this, but I do not think that it is possible by only using a t-test. That is the question – jaws234 Aug 21 '18 at 12:15
  • 1
    You can run a series of t tests but that is not a good idea. With several models and years you won't get a good idea of the structure of the data that way. Nor you can combine t test P-values because the tests are not independent! Hence the advice to explore a regression model. (FWIW, your question referred repeatedly to a t test and you emphasised between countries. As it doesn't indicate what you really want, you should rewrite it.) – Nick Cox Aug 21 '18 at 12:22
  • 3
    I think some of the comments may have been unintentionally harsh. I read this as a regression question. In that setting a t-test arises naturally as a test of the country coefficient. If that means nothing to you then you have a lot of research to do (and your teacher has been unfair in asking you to do something you are unprepared for), but if it sounds familiar, then you should have no trouble proceeding. – whuber Aug 21 '18 at 13:43
  • 1
    This not possible for ONE `t.test`, as `t.test` can statistically compare (the averages of) 2 groups (i.e. vectors) without taking into account any other info. You can perform multiple `t.test`s for each combination of `Car` and `Year`, but that would be tough if you have lots of combinations. What people typically look for when they say something like what you've described is the output of something like `lm(Sales ~ Country + Car + Year)`, where you can interpret the p-value of the `Country` coefficient, when controlling for `Year` and `Car`. – AntoniosK Aug 21 '18 at 12:34
  • Perform multiple t.test for each combination of Car and Year, combining them all at the end for getting a final p value is something that I also considered. However, I am not sure how statistically correct it is. Any insights on this? – jaws234 Aug 21 '18 at 14:28
  • 3
    Best thing to do is specifically ask the person who assigned you this to tell you how he's imagining the output he expects from you. It's not wrong to have these discussions with them until you have a clear plan of action :) – AntoniosK Aug 21 '18 at 14:31
  • [How exactly does one “control for other variables”?](https://stats.stackexchange.com/q/17336/17230) may be helpful. – Scortchi - Reinstate Monica Aug 22 '18 at 09:44

1 Answers1

1

Performing a t-test in R can be done using the t.test() function. The real question here, is what question you want to answer using the test. Reading your description, I think you want so test whether the mean car sales is different between countries, where you want to match (or pair up) the model and year. The null hypothesis would be: there is no difference between countries.

It seems that you have the same type of cars and years in both datasets that you can compare. If this is true and they are in the same order, you can do a paired t-test ( paired=TRUE ). If not, I would be hesitant of using a t-test, since you are comparing different cars.

The alternative hypothesis or 'effect' you would like to measure is whether any country 1 or 2 has more sales. Hence alternative='two.sided

country1 <- c(100,200,80,90,66,20)
country2 <- c(120,220,82,92,62,22)

t.test(x=country1,
       y=country2,
       paired=TRUE,
       alternative='two.sided')

If you would like to use year and model as covariates, I would suggest AntoniosK's answer of using a linear model

  • 1
    Why `paired = T`? Those are 2 different countries and not the same country (before vs. after). We don't even know if the dataset have the same number of rows, which is something that would break the paired test straight away :) – AntoniosK Aug 21 '18 at 12:48
  • 1
    You are right that paired should only be used if the same observations are present in both dataset. - I edited the comment to clear this up - It is fine though if the countries are different, since we want to compare those. But the car model and year should be the same in both datasets and in the same order. Since this is true is the brief description I made the assumption that they were. But you are right that we can't be sure –  Aug 21 '18 at 12:51
  • You are right, I missed so information in the question. Yes, I do have the same car models in both countries. I do not have the same observations, but I understand that it would not be a problem to include the missing observations as sales = 0. And yes, the output I am looking for is to check if there is a significant different between countries looking at the sales per model and month. Could you confirm me if it would be correct to add non existing observations as sales = 0? – jaws234 Aug 21 '18 at 14:23
  • Another consideration is that sales in country 1 are way bigger than in country 2, which is not relevant for me as I am only looking for the curve the data points draw. Would this affect the t-test? – jaws234 Aug 21 '18 at 14:25
  • 1
    @jaws234 that's a tricky one. If you add sales = 0 the model will understand that this car model was available but it wasn't sold. However, you don't know if missing info means that that car model wasn't available. That's basically the difference between `0` and `NA`. – AntoniosK Aug 21 '18 at 14:35
  • Correct, you should code missing values as `NA`. Then using `t.test(na.action="na.omit")` you will exclude the `NA` value and its mate –  Aug 21 '18 at 14:47
  • 1
    (+1) This approach doesn't offer much flexibility, or provide any insight into the relation of Model or Year to Sales, but it does answer the question - if you were to assume equal variances, it's equivalent to regressing Sales on Country + Car_Model + Year + Car_Model x Year & performing a t-test on the regression coefficient for Country. – Scortchi - Reinstate Monica Aug 22 '18 at 10:09
  • 1
    @jaws234: "Another consideration is that sales in country 1 are way bigger than in country 2, which is not relevant for me as I am only looking for the curve the data points draw. Would this affect the t-test?" - yes it's precisely what the t-test is trying to discern. Perhaps you need to ask a new question clearly explaining the context - how the data were collected, what they represent, & what you're trying to find out from them. – Scortchi - Reinstate Monica Aug 22 '18 at 10:57