Test scores and kindergarten: a multivariate analysis

I try to find the relationship between test scores and kindergarten experience, taking background variables into account. The conclusions might have implications for parents.


Find the relationship between test scores and kindergarten experience, taking background variables into account.


  • The dependent variables are math, literacy, social. These are test results.
  • The explanatory are divided into
    • variables of interest kindergarten_type, kindergarten_amt, and
    • control variables gender, parents_educ, birth_year, birth_month, ID.


  • Students who attend kindergarten tend to have higher test scores.
  • Students who spend more time in kinderkarten per week, tend to perform better on the test.


Multivariate multiple linear regression (MultiMLR) is used. This method allows for many Y variables and several explanatory X variables, as opposed to multiple linear regression (MLR) in which there is one single Y variable. I used MultiMLR instead of the more common MLR approach because the Y variables are correlated. (Further details on the difference between MLR and MultiMLR can be found here and here.)


R script, including output graphs and tables can be found in this RPubs document, which is written using RMarkdown. Comments — and some variable names — are in Swedish.


Make a screencast going over the RPubs document in English, since some variables are coded in Swedish.

 Analysis outline

Until I have done a proper screencast to explain the analysis, I provide some pointers to the key outputs below.

  1. A major part of the code is to change the variables, because the dataset we were given should not be used as is. Firstly, I recode the variables birth_year, birth_month into which quarter of the year the studen was born. Secondly, I reshape the two categorical variables kindergarten_type, kindergarten_amt into two dummy variables measuring (a) whether the student went to kindergarten or not, and (b) if they spent extra time during a typical week in the kindergarten.
  2. A scatter plot displays the relationship among the Y variables. Combined with cor() we can tell they dependent variables are correlated.
    pairs(~ math + literacy + social, df) produces the following graph:plot1

    Then I create three boxplots using to visualize the distribution — it looks reasonable without any outliers to handle. par(mfrow=c(1,3))aligns the boxplots in three columns, and boxplot(df$variable_name, df) produces the plots:plot2

  3. Below is a detailed argument for why a MultiMLR is a good method. I use a decision tree of questions to come up with the model choice.
    • Q: What type of relationships is being examined, dependent or independent?
      A: Dependent.
    • Q: Number of dependent variables?
      A: Several
    • Q: Measurement scale of the dependent variables?
      A: Numerical, Y go from 0 to 50.
    • Measurement scale of the explanatora variables?
      A: Numerical and categorical.
    • Based on the questions above we end up with multivariate multiple linear regression as our model choice. The method is also suitable given the research question.
  4. The fitted model is
    lm(cbind(math, literacy, social) ~
             #variables of interest:
             kinderg + spec_kinderg + extratime 
             #control variables
             + parents_educ + boy + kva1 +kva2 +kva3 
            , data=df)

    (I have translated the variable names above, compared to in the document.)

    • The regression measures how much the three variables of interest affects the three Y variables, whilst controling for parents education, gender, and which quarter of the year the student were born.
    • The three variables of interest
      (i) kinderg,
      (ii) spec_kinderg,
      (iii) extratime
      are all dummy variables. They measure whether the students have
      (i) went to kindergarten at all,
      (ii) went to a special kindergarten called “Montesori” or “Ur och Skur”,
      (iii) spent over 15 hours in the kindergarten per week.
  5. Estimated coefficients of the model are found below. I pasted it from the document, translated Swedish variables and rounded the estimates.
                  math      literacy  social  
    (Intercept)   23.73265  16.07515  32.51889
    kinderg          7.40438   1.92834   9.49967
    spec_kinderg  Not Sign.  -2.41103  -3.19622
    extratime       1.52378   6.57021   5.55214

    From this table of coefficients we can draw some conclusions:

    • Students who attend kindergarten tend to have higher test scores. How much higher is seen from the coefficients above: 7.4 for math, 1.9 for literacy and 9.5 for social. Notably, literacy has the least increase.
    • Students who spend over 15 hours in kinderkarten, tend to perform better on the test. Notably, spending more hours affect the literacy and social scores much more the math results.


Test scores and kindergarten: a multivariate analysis