Fork me on GitHub


Bootstrapping for Propensity Score Analysis


As the popularity of propensity score methods for estimating causal effects in observational studies increase, the choice of specific method has also increased. Rosenbaum (2012) has suggested that there are substantial benefits for testing a hypothesis more than once. Moreover, with the wide availability of high power computers resampling methods such as bootstrapping (Efron, 1979) have become popular for providing more stable estimates of the sampling distribution. This paper introduces the PSAboot package for R that provides functions for bootstrapping propensity score methods. It deviates from traditional bootstrapping methods by allowing for different sampling specifications for treatment and control groups. Additionally, this framework will provide estimates using multiple methods for each bootstrap sample. Two examples are discussed: the classic National Work Demonstration and PSID (Lalonde, 1986) study and a study on tutoring effects on student grades.


You can download PSAboot from CRAN:

install.packages("PSAboot", repos = "")

Or the latest development version of the PSAboot package can be downloaded from Github using the devtools package.

devtools::install_github("PSAboot", "jbryer")

Then load the package.



The package includes three demos:


The PSAboot function will perform the actual bootstrapping. It has a number of parameters for to specify how the bootstrap samples are drawn.

Other parameters can be passed to methods using the ... parameter.


The methods parameter on the PSAboot function specifies the different propensity score methods that will be used. Specifically, for each bootstrap sample drawn, each function will be called. This allows for a comparison of methods across all bootstrap samples. Five methods are included, they are:

Defining Custom Methods

It is possible to define a custom method. Each method is a function with, at minimum, the following six parameters:

Each method must return a list with three elements:

For example, the boot.matching.1to3 function below wraps the built-in boot.matching method but sets the M parameter to 3, thereby performing 1-to-3 matching instead of the default 1-to-1 matching. This framework simplifies the process of using, and comparing, slight variations of different propensity score methods.

boot.matching.1to3 <- function(Tr, Y, X, X.trans, formu, ...) {
    return(boot.matching(Tr = Tr, Y = Y, X = X, X.trans = X.trans, formu = formu, M = 3, ...))

The PSAboot function returns an object of class PSAboot. The following S3 methods are implemented: print, summary, plot, boxplot, and matrixplot.

Example: National Work Demonstration and PSID

The lalonde (Lalonde, 1986) has become the de defacto teaching dataset in PSA since Dehejia and Wahba’s (1999) re-examination of the National Supported Work Demonstration (NSW) and the Current Population Survey (CPS).

The lalonde data set is included in the MatchIt package. The crosstab shows that there are 429 control units and 185 treatment units.

data(lalonde, package = "MatchIt")

  0   1 
429 185 

lalonde.formu <- treat ~ age + I(age^2) + educ + I(educ^2) + black + hispan + married + nodegree + 
    re74 + I(re74^2) + re75 + I(re75^2) + re74 + re75
boot.lalonde <- PSAboot(Tr = lalonde$treat, Y = lalonde$re78, X = lalonde, formu = lalonde.formu, 
    M = 100, seed = 2112)

100 bootstrap samples using 5 methods.
Bootstrap sample sizes:
   Treated=185 (100%) with replacement.
   Control=429 (100%) with replacement.

The summary function provides numeric results for each method including the overall estimate and confidence interval using the complete sample as well as the pooled estimates and confidence intervals with percentages of the number of confidence intervals that do not span zero.


Stratification Results:
   Complete estimate = 550
   Complete CI = [-1168, 2269]
   Bootstrap pooled estimate = 636
   Bootstrap pooled CI = [-1123, 2395]
   16% of bootstrap samples have confidence intervals that do not span zero.
      15% positive.
      1% negative.
ctree Results:
   Complete estimate = 169
   Complete CI = [-1552, 1891]
   Bootstrap pooled estimate = 679
   Bootstrap pooled CI = [-1528, 2886]
   22% of bootstrap samples have confidence intervals that do not span zero.
      18% positive.
      4% negative.
rpart Results:
   Complete estimate = -875
   Complete CI = [-2567, 816]
   Bootstrap pooled estimate = -825
   Bootstrap pooled CI = [-3997, 2348]
   21% of bootstrap samples have confidence intervals that do not span zero.
      4% positive.
      17% negative.
Matching Results:
   Complete estimate = -606
   Complete CI = [-1296, 83.2]
   Bootstrap pooled estimate = -577
   Bootstrap pooled CI = [-2814, 1659]
   75% of bootstrap samples have confidence intervals that do not span zero.
      16% positive.
      59% negative.
MatchIt Results:
   Complete estimate = 617
   Complete CI = [-801, 2034]
   Bootstrap pooled estimate = 365
   Bootstrap pooled CI = [-1287, 2018]
   10% of bootstrap samples have confidence intervals that do not span zero.
      8% positive.
      2% negative.

The plot function plots the estimate (mean difference) for each bootstrap sample. The default is to sort from largest to smallest estimate for each method separately. That is, rows do not correspond across methods. The sort parameter can be set to none for no sorting or the name of any method to sort only based upon the results of that method. In these cases the rows then correspond to matching bootstrap samples. The blue points correspond to the the estimate for each bootstrap sample and the horizontal line to the confidence interval. Confidence intervals that do not span zero are colored red. The vertical blue line and green lines correspond to the overall pooled estimate and confidence for each method, respectively.


plot of chunk lalonde.plot

The hist function plots a histogram of the estimates across all bootstrap samples for each method.


plot of chunk lalonde.histogram

The boxplot function depicts the distribution of estimates for each method along with confidence intervals in green. Additionally, the overall pooled estimate and confidence interval across all bootstrap samples and methods are represented by the vertical blue and green lines, respectively.


plot of chunk lalonde.boxplot

The matrixplot summarizes the estimates across methods for each bootstrap sample. The lower half of the matrix are scatter plots where each point represents the one bootstrap sample. The red line is a Loess regression line. The main diagonal depicts the distribution of effects and the upper half provides the correlation of estimates. In the following example we see that the ctree and Stratification methods have the strongest agreement with a correlation of 0.63 whereas the rpart and MatchIt methods have the least agreement with a correlation of 0.22.


plot of chunk lalonde.matrixplot

Evaluating Balance

The strength of propensity score analysis relies on achieving good balance. Typically one would evaluate each covariate separately to ensure that sufficient balance has been achieved. We recommend Helmreich and Pruzek (2009) for a more complete discussion of visualizations to evaluate balance. Given the large number of samples and methods used, it is desirable to have a single metric to evaluate balance. Drawing on the principles of the multiple covariate balance assessment plot (Helmreich & Pruzek, 2009), the balance function will estimate the effect of each covariate before and after adjustment. Moreover, to provide a single metric for each sample and method, the mean standardized effect size will be used.

lalonde.bal <- balance(boot.lalonde)

Unadjusted balance: 0.443234083160092
               Complete Bootstrap
Stratification   0.1487    0.1447
ctree            0.2013    0.1608
rpart            0.2938    0.2886
Matching         0.1296    0.1917
MatchIt          0.2787    0.2744

The plot function provides density plots of the balance statistics across all bootstrap samples for each method. The mean overall balance is represented by the vertical black lines, the overall balance for the complete dataset is represented by the vertical blue line, and the adjusted balance is represented by the vertical red line. Although no specific guidelines are recommended in the literature, achieving a balance of less than 0.1 is desirable. In this example we see that PSA has reduce the bias in all methods although rpart, and to a lesser extent MatchIt, did not reduce the bias by as much as would be typically desired.


plot of chunk lalonde.balance.plot

The boxplot function provides a more nuanced depiction of the balance by separating on the distribution of balance statistics by individual covariate. In addition to the boxplot of balance statistics, the mean balance statistic is represented by the blue point and the unadjusted balance statistic by the red point. We see that the largest imbalance was in the black, married, and re74 (real earnings in 1974) before adjustment. All of the estimates reduced the bias in these covariates although it is worth noting that the MatchIt method did not reduce the bias in the black covariate to a desirable level.

This figure, in conjunction with the density plot above, show that the relatively large mean balance for rpart is likely due to some outlier samples where the adjusted balance for educ, re75, and hispan are fairly large. It should be noted that with these exceptions, balance for each covariates are generally less than the unadjusted balance.


plot of chunk lalonde.balance.boxplot

= Github page; = RSS XML Feed; = External website; = Portable Document File (PDF)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Creative Commons License
Formulas powered by MathJax