Fork me on GitHub

Comparing two data frames with different number of rows


January 24, 2013
Categorized as: R, R-Bloggers, comparison


I posted a question over on StackOverflow on an efficient way of comparing two data frames with the same column structure, but with different rows. What I would like to end up with is an n x m logical matrix where n and m are the number of rows in the first and second data frames, respectively; and the value at the ith row and jth column indicates whether all the values from row i from data frame one is equal to row j from data frame two. To provide some context, this will be used in a propensity score matching algorithm to identify candidate matches that match exactly on any number of covariates. In addition to the approaches I had, joran provided an approach using the Vectorize function (thanks again as I learned another nice function). I decided to put three approaches to a race

To understand what I need, I’ll start with a small example with two data frames, one with 4 rows, the other with 3, and each has two variables, one logical and the other numeric. As an aside, I only need this to work for integers, factors, characters, and logical types therefore avoiding issues of comparing numerics.

> df1 <- data.frame(row.names=1:4, var1=c(TRUE, TRUE, FALSE, FALSE), var2=c(1,2,3,4))
> df2 <- data.frame(row.names=5:7, var1=c(FALSE, TRUE, FALSE), var2=c(5,2,3))
> df1
   var1 var2
1  TRUE    1
2  TRUE    2
3 FALSE    3
4 FALSE    4
> df2
   var1 var2
5 FALSE    5
6  TRUE    2
7 FALSE    3

First, let’s consider the case when there is only one variable:

> system.time({

This is pretty straight forward. Now I want the same type of result, but to compare more than one column (in the final implementation I need to handle any number of columns so not necessarily limited to one or two).

The first approach uses nested apply functions.

> system.time({

Secondly, using the Vectorize and outer functions.

> system.time({

Lastly, we’ll create a new character vector by pasting the other variables together.

> system.time({

We can already see with this small example that the Vectorize approach is the slowest. However, let’s try a larger example. First we’ll create two data frames, one with 1,000 rows and the second with 1,500. The resulting matrix will be 1,000 x 1,500.

set.seed(2112)
df1 <- data.frame(row.names=1:1000, 
				  var1=sample(c(TRUE,FALSE), 1000, replace=TRUE), 
				  var2=sample(1:10, 1000, replace=TRUE) )
df2 <- data.frame(row.names=1001:2500, 
				  var1=sample(c(TRUE,FALSE), 1500, replace=TRUE),
				  var2=sample(1:10, 1500, replace=TRUE))

Nested apply functions approach:

> system.time({

Vectorize approach:

> system.time({

Combined columns approach:

> system.time({

The combined column approach is by far the fasted way, and it makes good since. It is a bit surprising (at least to me), how much worse the Vectorize and outer functions are. Moreover, I am a bit concerned about potential issues with the paste method and doing comparisons on those results. Please feel free to leave comments below if there are other approaches.

comments powered by Disqus

= Github page; = RSS XML Feed; = External website; = Portable Document File (PDF)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Creative Commons License
Formulas powered by MathJax