Friday, January 21, 2011

Posted Question for R Users

I recently undertook a project where a colleague had about 12 .csv files that they wanted to merge. Each file had a common (key) variable 'Partner' (which is trading partner) with differing columns (variables) except for the common key variable. Actually, the first column (Partner) was the same in each data set, with each data set having 211 rows. The remaining columns were unique to each data set. I required a combined data set.

In SAS (PROC SQL) this would be a matter of a series of left joins. Working in the R environment, I executed what was equivalent to left joins via the 'merge' function. However this was tedious, only being able to join 2 tables at a time. In SAS I could use the merge function, which would allow me to merge all 12 tables in 1 data step.

Does anyone know of a better way to do this in R, as opposed to my merge statements? (see sample code below)

R Merge Statements:

# -------------------------------------------------------------------
#|  merge data sets with R merge function              
# ------------------------------------------------------------------
 
#  left join b_dat onto a_dat on variable Partner
 
 ab <- merge(a_dat,b_dat, by.a_dat = Partner, by.b_dat =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 
# left join c_dat onto ab on variable Partner
 
 a_c <- merge(ab,c_dat, by.ab = Partner, by.c_dat =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 
# I have about 10 more data sets to left join with a_c, is there a better way to join these 
# in R as opposed to pairwise merges like above?

Created by Pretty R at inside-R.org


I found some help on r-wiki, it merges all 3 data sets at once, but gives me extra redundant columns with .x and .y appended to their names. I'm not sure about these results.

my.list <- list(a_dat, b_dat, c_dat)
 
DF <- a_dat
for ( .df in my.list ) {
  DF <-merge(DF,.df,by.x="Partner", by.y="Partner", all = FALSE, all.x = TRUE, all.y = FALSE)
 }
 
dim(DF)

Created by Pretty R at inside-R.org



SAS - similar code for 3 data sets DAT_A, DAT_B, DAT_C

DATA ALL_COUNTRIES;
MERGE DAT_A (IN=A) DAT_B (IN=B)DAT_C (IN=C);
BY PARTNER;
IF B AND C; /* Actually its been so long since */
RUN; /* I have used a data step vs PROC SQL I'm not sure this*/
/* statement gives me a left join, but I'm not sure I can */
/* do any better than 2 at a time left joins in PROC SQL */
/* so that would not be any better than my R code above */

2 comments:

  1. Do you know about Stack Overflow? Here's a similar question with the answer I would have suggested:

    http://stackoverflow.com/questions/2209258/merge-several-data-frames-into-one-data-frame-with-a-loop

    ReplyDelete
  2. Thanks. I have implemented those changes and it worked elegantly!

    ReplyDelete