What you describe is a reliability study where each subject is going to be assessed by the same three raters on two occasions. Analysis can be done separately on the two outcomes (length and weight, though I assume they will be highly correlated and you're not interested in how this correlation is reflected in raters' assessments). Estimating measurement reliability can be done in two ways:
- The original approach (as described in Fleiss, 1987) relies on the analysis of variance components through an ANOVA table, where we assume no subject by rater interaction (the corresponding SS is constrained to 0) -- of course, you won't look at $p$-values, but at the MSs corresponding to relevant effects;
- A mixed-effects model allows to derive variance estimates, considering time as a fixed effect and subject and/or rater as random-effect(s) (the latter distinction depends on whether you consider that your three observers were taken or sampled from a pool of potential raters or not -- if the rater effect is small, the two analyses will yield quite the same estimate for outcome reliability).
In both cases, you will be able to derive a single intraclass correlation coefficient, which is a measure of reliability of the assessments (under the Generalizability Theory, we would call them generalizability coefficients), which would answer your second question. The first question deals with a potential effect of time (considered as a fixed effect), which I discussed here, Reliability in Elicitation Exercise. More details can be found in Dunn (1989) or Brennan (2001).
I have an R example script on Github which illustrates both approaches. I think it would not be too difficult to incorporate rater effects in the model.
References
- Fleiss, J.L. (1987). The design and analysis of clinical experiments. New York: Wiley.
- Dunn, G. (1989). Design and analysis of reliability studies. Oxford
- Brennan, R.L. (2001). Generalizability Theory. Springer