<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom">
 
  <title>Jason.Bryer.org Blog</title>
  <link href="http://jason.bryer.org/"/>
  <link type="application/atom+xml" rel="self" href="http://jason.bryer.org/feed.r-bloggers.xml"/>
  <updated>2013-05-09T15:35:59-07:00</updated>
  <id>http://jason.bryer.org/</id>
  <author>
    <name>Jason Bryer</name>
    <email>jason@bryer.org</email>
  </author>
 
  
  <entry>
    <id>http://jason.bryer.org/posts/2013-05-09/Version_0.9_of_timeiline_on_CRAN</id>
    <link type="text/html" rel="alternate" href="http://jason.bryer.org/posts/2013-05-09/Version_0.9_of_timeiline_on_CRAN.html"/>
    <title>Version 0.9 of timeline on CRAN</title>
    <published>2013-05-09T00:00:00-07:00</published>
    <updated>2013-05-09T00:00:00-07:00</updated>
    <author>
      <name>Jason Bryer</name>
      <uri>http://jason.bryer.org/</uri>
    </author>
    <content type="html">&lt;p&gt;The initial version of the &lt;code&gt;timeline&lt;/code&gt; package has been released to CRAN. This package provides creates timeline plots using &lt;code&gt;ggplot2&lt;/code&gt; in a style similar to &lt;a href='http://www.preceden.com/'&gt;Preceden&lt;/a&gt;. I would considered this beta quality as there are more features I would like to add but has enough functionality to possibly be useful to others.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;install.packages(&amp;#39;timeline&amp;#39;,repos=&amp;#39;http://cran.r-project.org&amp;#39;)
require(timeline)
data(ww2)
timeline(ww2, ww2.events, event.spots=2, event.label=&amp;#39;&amp;#39;, event.above=FALSE)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src='http://jason.bryer.org/images/timeline/ww2.png' alt='Timeline of World War II' /&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;ww2&lt;/code&gt; demo (type &lt;code&gt;demo(ww2)&lt;/code&gt; at the R console to start) provides many variations of the timeline figure. There is also a Shiny app to explore some of the parameters to the &lt;code&gt;timeline&lt;/code&gt; function.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;timelineShinyDemo()&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Or try the Shiny App from the &lt;a href='http://rstudio.com'&gt;RStudio Server&lt;/a&gt; at &lt;a href='http://spark.rstudio.com/jbryer/timeline/'&gt;http://spark.rstudio.com/jbryer/timeline/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You can always download the latest development version using &lt;code&gt;devtools&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;require(devtools)
install_github(&amp;#39;timeline&amp;#39;,&amp;#39;jbryer&amp;#39;)&lt;/code&gt;&lt;/pre&gt; &lt;a href='http://jason.bryer.org/posts/2013-05-09/Version_0.9_of_timeiline_on_CRAN.html'&gt;Read full post...&lt;/a&gt;</content>
  </entry>
  
  <entry>
    <id>http://jason.bryer.org/posts/2013-05-08/Gamblers_Run_With_Shiny</id>
    <link type="text/html" rel="alternate" href="http://jason.bryer.org/posts/2013-05-08/Gamblers_Run_With_Shiny.html"/>
    <title>Gambler's Run With Shiny</title>
    <published>2013-05-08T00:00:00-07:00</published>
    <updated>2013-05-08T00:00:00-07:00</updated>
    <author>
      <name>Jason Bryer</name>
      <uri>http://jason.bryer.org/</uri>
    </author>
    <content type="html">&lt;p&gt;I finally had an opportunity to play with &lt;a href='http://rstudio.com/shiny'&gt;Shiny&lt;/a&gt;, and I am very impressed. I have created a &lt;a href='http://github.com/jbryer/ShinyApps'&gt;Github Project&lt;/a&gt; so head over there for the source code. There are a number of ways to distribute Shiny apps. If you are running R (and mostly likely you are if you are reading this), you can download and run Shiny apps using the &lt;code&gt;runApp&lt;/code&gt; (if already downloaded), &lt;code&gt;runGitHub&lt;/code&gt;, &lt;code&gt;runGist&lt;/code&gt;, or &lt;code&gt;runUrl&lt;/code&gt; functions. RStudio also make the &lt;a href='http://rstudio.github.io/shiny/tutorial/#deployment-web'&gt;Shiny Server&lt;/a&gt; available and you can also &lt;a href='https://rstudio.wufoo.com/forms/shiny-server-beta-program/'&gt;request an account&lt;/a&gt; on their servers. Also be sure to check out the excellent &lt;a href='http://rstudio.github.io/shiny/tutorial/'&gt;tutorial&lt;/a&gt; on Shiny.&lt;/p&gt;

&lt;p&gt;First, install &lt;code&gt;shiny&lt;/code&gt; and &lt;code&gt;shinyIncubator&lt;/code&gt; (for the &lt;code&gt;ActionButton&lt;/code&gt;) packages, preferably the development versions.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;require(devtools)
install_github(&amp;#39;shiny&amp;#39;, &amp;#39;rstudio&amp;#39;)
install_github(&amp;#39;shiny-incubator&amp;#39;, &amp;#39;rstudio&amp;#39;)
require(shiny)&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id='gamblers_run'&gt;Gambler&amp;#8217;s Run&lt;/h4&gt;

&lt;p&gt;This simple app that lets you simulate a sequence of random events, for example coin flips, and plot the cummulative sum. This app allows you choose the odds of winning, the number of games to simulate, and the number of simulations to display simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;img src='https://raw.github.com/jbryer/ShinyApps/master/screens/gambler.png' alt='Gambler Shiny App' /&gt;&lt;/p&gt;

&lt;p&gt;To run the app locally:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shiny::runGitHub(&amp;#39;ShinyApps&amp;#39;, &amp;#39;jbryer&amp;#39;, subdir=&amp;#39;gambler&amp;#39;)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Or from the &lt;a href='http://spark.rstudio.com/jbryer/gambler'&gt;RStudio server&lt;/a&gt; (note that RStudio does not guarantee the server will always be up so this link may or may not work).&lt;/p&gt;

&lt;h4 id='lottery_tickets'&gt;Lottery Tickets&lt;/h4&gt;

&lt;p&gt;Similar to the &lt;code&gt;gambler&lt;/code&gt; app, this simulates buying a series of lottery tickets with varying odds of winning different amounts. Each previous run is saved and plotted in light grey to show how the current run compares to past runs.&lt;/p&gt;

&lt;p&gt;&lt;img src='https://raw.github.com/jbryer/ShinyApps/master/screens/lottery.png' alt='Lottery Tickets Shiny App' /&gt;&lt;/p&gt;

&lt;p&gt;To run the app locally:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shiny::runGitHub(&amp;#39;ShinyApps&amp;#39;, &amp;#39;jbryer&amp;#39;, subdir=&amp;#39;lottery&amp;#39;)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Or from the &lt;a href='http://spark.rstudio.com/jbryer/lottery'&gt;RStudio server&lt;/a&gt; (note that RStudio does not guarantee the server will always be up so this link may or may not work).&lt;/p&gt;

&lt;p&gt;Just to try out all the ways to distribute Shiny apps, I also created a &lt;a href='https://gist.github.com/jbryer/5525690'&gt;Gist&lt;/a&gt; for this app.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shiny::runGist(&amp;quot;5525690&amp;quot;)&lt;/code&gt;&lt;/pre&gt; &lt;a href='http://jason.bryer.org/posts/2013-05-08/Gamblers_Run_With_Shiny.html'&gt;Read full post...&lt;/a&gt;</content>
  </entry>
  
  <entry>
    <id>http://jason.bryer.org/posts/2013-04-18/Cut_Dates_Into_Quarters</id>
    <link type="text/html" rel="alternate" href="http://jason.bryer.org/posts/2013-04-18/Cut_Dates_Into_Quarters.html"/>
    <title>Cut Dates Into Quarters</title>
    <published>2013-04-18T00:00:00-07:00</published>
    <updated>2013-04-18T00:00:00-07:00</updated>
    <author>
      <name>Jason Bryer</name>
      <uri>http://jason.bryer.org/</uri>
    </author>
    <content type="html">&lt;p&gt;Frequently I need to recode a date column to quarters. For example, at &lt;a href='http://www.excelsior.edu'&gt;Excelsior College&lt;/a&gt; we have continuous enrollment so we report new enrollments per quarter. To complicate things a bit, our fiscal year starts in July so that July, August, and September represent the first quarter, January, February, and March are actually the third quarter. But sometimes we do need need to report out based upon calendar years (i.e. where January is in the first quarter). I am sure this is pretty common practice in many disciplines. There are probably other ways to do this in R (please comment below about other methods), but could not find one that satisfies my needs.&lt;/p&gt;

&lt;p&gt;We can begin by &lt;code&gt;source&lt;/code&gt;ing the function from &lt;a href='https://gist.github.com/jbryer/5412193'&gt;Gist&lt;/a&gt; using the &lt;code&gt;devtools&lt;/code&gt; package.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;require(devtools)
source_gist(5412193)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Create a vector of &lt;code&gt;Dates&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; dates &amp;lt;- as.Date(c(&amp;#39;2013-04-03&amp;#39;,&amp;#39;2012-03-30&amp;#39;,&amp;#39;2011-10-31&amp;#39;,
                   &amp;#39;2011-04-14&amp;#39;,&amp;#39;2010-04-22&amp;#39;,&amp;#39;2004-10-04&amp;#39;,
                   &amp;#39;2000-02-29&amp;#39;,&amp;#39;1997-12-05&amp;#39;,&amp;#39;1997-04-23&amp;#39;,
                   &amp;#39;1997-04-01&amp;#39;))&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The default is to use the typical academic fiscal year with the year staring July 1.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; getYearQuarter(dates)
 [1] FY2013-Q4 FY2012-Q3 FY2012-Q2 FY2011-Q4 FY2010-Q4 FY2005-Q2 FY2000-Q3 FY1998-Q2 FY1997-Q4
[10] FY1997-Q4
65 Levels: FY1997-Q4 &amp;lt; FY1998-Q1 &amp;lt; FY1998-Q2 &amp;lt; FY1998-Q3 &amp;lt; FY1998-Q4 &amp;lt; FY1999-Q1 &amp;lt; ... &amp;lt; FY2013-Q4&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;However, it easy to use get a quarters within a calendar year.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; getYearQuarter(dates, firstMonth=1)
 [1] FY2013-Q2 FY2012-Q1 FY2011-Q4 FY2011-Q2 FY2010-Q2 FY2004-Q4 FY2000-Q1 FY1997-Q4 FY1997-Q2
[10] FY1997-Q2
65 Levels: FY1997-Q2 &amp;lt; FY1997-Q3 &amp;lt; FY1997-Q4 &amp;lt; FY1998-Q1 &amp;lt; FY1998-Q2 &amp;lt; FY1998-Q3 &amp;lt; ... &amp;lt; FY2013-Q2&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can also alter the format of the levels using the &lt;code&gt;fy.prefix&lt;/code&gt;, &lt;code&gt;quarter.prefix&lt;/code&gt;, and &lt;code&gt;sep&lt;/code&gt; parameters.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; getYearQuarter(dates, 1, &amp;#39;&amp;#39;, &amp;#39;&amp;#39;, &amp;#39;&amp;#39;)
 [1] 20132 20121 20114 20112 20102 20044 20001 19974 19972 19972
65 Levels: 19972 &amp;lt; 19973 &amp;lt; 19974 &amp;lt; 19981 &amp;lt; 19982 &amp;lt; 19983 &amp;lt; 19984 &amp;lt; 19991 &amp;lt; 19992 &amp;lt; ... &amp;lt; 20132&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Lastly, the function by default will create a level for each quarter between the minimum and maximum dates in the date vector passed in. You can override the range for defining the levels with the &lt;code&gt;level.range&lt;/code&gt; parameter. If the specified range is smaller than the range of the passed in vector, the function will print a warning because values outside that range will be returned as &lt;code&gt;NA&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; getYearQuarter(dates, level.range=as.Date(c(&amp;#39;2010-01-01&amp;#39;,&amp;#39;2013-01-01&amp;#39;)))
 [1] &amp;lt;NA&amp;gt;      FY2012-Q3 FY2012-Q2 FY2011-Q4 FY2010-Q4 &amp;lt;NA&amp;gt;      &amp;lt;NA&amp;gt;      &amp;lt;NA&amp;gt;      &amp;lt;NA&amp;gt;     
[10] &amp;lt;NA&amp;gt;     
13 Levels: FY2010-Q3 &amp;lt; FY2010-Q4 &amp;lt; FY2011-Q1 &amp;lt; FY2011-Q2 &amp;lt; FY2011-Q3 &amp;lt; FY2011-Q4 &amp;lt; ... &amp;lt; FY2013-Q3
Warning message:
In getYearQuarter(dates, level.range = as.Date(c(&amp;quot;2010-01-01&amp;quot;, &amp;quot;2013-01-01&amp;quot;))) :
  The range of x is greater than level.range. Values outside level.range will be returned as NA.&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here is a link to the &lt;a href='https://gist.github.com/jbryer/5412193'&gt;Gist&lt;/a&gt; or copy-and-paste from below.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#&amp;#39; Returns the year (fiscal or calendar) and quarter in which the date appears.
#&amp;#39; 
#&amp;#39; This function will cut the given date vector into quarters (i.e. three month
#&amp;#39; increments) and return an ordered factor with levels defined to be the quarters
#&amp;#39; between the minimum and maximum dates in the given vector. The levels, by
#&amp;#39; default, will be formated as \code{FY2013-Q1}, however the \code{FY} and \code{Q}
#&amp;#39; can be changed using the \code{fy.prefix} and \code{quarter.prefix} parameters,
#&amp;#39; respectively.
#&amp;#39; 
#&amp;#39; @param x vector of type \code{\link{Date}}.
#&amp;#39; @param firstMonth the month corresponding to the first month of the fiscal year.
#&amp;#39;        Setting \code{firstMonth=1} is equivalent calenadar years.
#&amp;#39; @param fy.prefix the character string to paste before the year.
#&amp;#39; @param quarter.prefix the character string to paste before the quarter.
#&amp;#39; @param sep the separater between the year and quarter.
#&amp;#39; @param level.range the range to use for defining the levels in the returned
#&amp;#39;        factor.
#&amp;#39; @export
#&amp;#39; @examples
#&amp;#39; 	dates &amp;lt;- as.Date(c(&amp;#39;2013-04-03&amp;#39;,&amp;#39;2012-03-30&amp;#39;,&amp;#39;2011-10-31&amp;#39;,
#&amp;#39; 	                   &amp;#39;2011-04-14&amp;#39;,&amp;#39;2010-04-22&amp;#39;,&amp;#39;2004-10-04&amp;#39;,
#&amp;#39; 	                   &amp;#39;2000-02-29&amp;#39;,&amp;#39;1997-12-05&amp;#39;,&amp;#39;1997-04-23&amp;#39;,
#&amp;#39; 	                   &amp;#39;1997-04-01&amp;#39;))
#&amp;#39; 	getYearQuarter(dates)
#&amp;#39; 	getYearQuarter(dates, firstMonth=1)
#&amp;#39; 	getYearQuarter(dates, 1, &amp;#39;&amp;#39;, &amp;#39;&amp;#39;, &amp;#39;&amp;#39;)
#&amp;#39; 	\dontrun{
#&amp;#39; 	getYearQuarter(dates, level.range=as.Date(c(&amp;#39;2010-01-01&amp;#39;,&amp;#39;2013-01-01&amp;#39;)))
#&amp;#39; 	}
getYearQuarter &amp;lt;- function(x, 
					   firstMonth=7, 
					   fy.prefix=&amp;#39;FY&amp;#39;, 
					   quarter.prefix=&amp;#39;Q&amp;#39;,
					   sep=&amp;#39;-&amp;#39;,
					   level.range=c(min(x), max(x)) ) {
	if(level.range[1] &amp;gt; min(x) | level.range[2] &amp;lt; max(x)) {
		warning(paste0(&amp;#39;The range of x is greater than level.range. Values &amp;#39;,
					   &amp;#39;outside level.range will be returned as NA.&amp;#39;))
	}
	quarterString &amp;lt;- function(d) {
		year &amp;lt;- as.integer(format(d, format=&amp;#39;%Y&amp;#39;))
		month &amp;lt;- as.integer(format(d, format=&amp;#39;%m&amp;#39;))
		y &amp;lt;- ifelse(firstMonth &amp;gt; 1 &amp;amp; month &amp;gt;= firstMonth, year+1, year)  
		q &amp;lt;- cut( (month - firstMonth) %% 12, breaks=c(-Inf,2,5,8,Inf), 
		      labels=paste0(quarter.prefix, 1:4))
		return(paste0(fy.prefix, y, sep, q))
	}
	vals &amp;lt;- quarterString(x)
	levels &amp;lt;- unique(quarterString(seq(
		as.Date(format(level.range[1], &amp;#39;%Y-%m-01&amp;#39;)), 
		as.Date(format(level.range[2], &amp;#39;%Y-%m-28&amp;#39;)), by=&amp;#39;month&amp;#39;)))
	return(factor(vals, levels=levels, ordered=TRUE))
}&lt;/code&gt;&lt;/pre&gt; &lt;a href='http://jason.bryer.org/posts/2013-04-18/Cut_Dates_Into_Quarters.html'&gt;Read full post...&lt;/a&gt;</content>
  </entry>
  
  <entry>
    <id>http://jason.bryer.org/posts/2013-03-26/i_Before_e_Except_After_c</id>
    <link type="text/html" rel="alternate" href="http://jason.bryer.org/posts/2013-03-26/i_Before_e_Except_After_c.html"/>
    <title>i Before e Except After c</title>
    <published>2013-03-26T00:00:00-07:00</published>
    <updated>2013-03-26T00:00:00-07:00</updated>
    <author>
      <name>Jason Bryer</name>
      <uri>http://jason.bryer.org/</uri>
    </author>
    <content type="html">&lt;p&gt;When I went to school we were always taught the &amp;#8220;i before e, except after c&amp;#8221; rule for spelling. But how accurate is this rule? Kevin Marks tweeted today the following:&lt;/p&gt;
&lt;blockquote class='twitter-tweet'&gt;&lt;p&gt;»@&lt;a href='https://twitter.com/uberfacts'&gt;uberfacts&lt;/a&gt;: There are 923 words in the English language that break the “I before E” rule. Only 44 words actually follow that rule.« Science&lt;/p&gt;&amp;mdash; Kevin Marks (@kevinmarks) &lt;a href='https://twitter.com/kevinmarks/status/316329566878695425'&gt;March 25, 2013&lt;/a&gt;&lt;/blockquote&gt;
&lt;p&gt;Not sure where he came up with that result, but seems simple enough to verify. First, download a English language word list compiled by Kevin Atkinson and available at &lt;a href='http://wordlist.sourceforge.net/'&gt;SourceForge&lt;/a&gt; (I will use the Parts of Speech Database, or &lt;a href='https://github.com/jbryer/jbryer.github.com/raw/master/_posts/part-of-speech.txt'&gt;download my version from Github&lt;/a&gt;). I also create a data frame (from the README file) &lt;code&gt;partsOfSpeech&lt;/code&gt; that maps the codes to descriptions that we will use later.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;require(ggplot2)
require(reshape)

partsOfSpeech &amp;lt;- as.data.frame(matrix(c(&amp;quot;N&amp;quot;, &amp;quot;Noun&amp;quot;, &amp;quot;P&amp;quot;, &amp;quot;Plural&amp;quot;, &amp;quot;h&amp;quot;, &amp;quot;Noun Phrase&amp;quot;, 
    &amp;quot;V&amp;quot;, &amp;quot;Verb (usu participle)&amp;quot;, &amp;quot;t&amp;quot;, &amp;quot;Verb (transitive)&amp;quot;, &amp;quot;i&amp;quot;, &amp;quot;Verb (intransitive)&amp;quot;, 
    &amp;quot;A&amp;quot;, &amp;quot;Adjective&amp;quot;, &amp;quot;v&amp;quot;, &amp;quot;Adverb&amp;quot;, &amp;quot;C&amp;quot;, &amp;quot;Conjunction&amp;quot;, &amp;quot;P&amp;quot;, &amp;quot;Preposition&amp;quot;, 
    &amp;quot;!&amp;quot;, &amp;quot;Interjection&amp;quot;, &amp;quot;r&amp;quot;, &amp;quot;Pronoun&amp;quot;, &amp;quot;D&amp;quot;, &amp;quot;Definite Article&amp;quot;, &amp;quot;I&amp;quot;, &amp;quot;Indefinite Article&amp;quot;, 
    &amp;quot;o&amp;quot;, &amp;quot;Nominative&amp;quot;), ncol = 2, byrow = TRUE), stringsAsFactors = FALSE)
names(partsOfSpeech) &amp;lt;- c(&amp;quot;Code&amp;quot;, &amp;quot;Description&amp;quot;)

words &amp;lt;- read.table(&amp;quot;part-of-speech.txt&amp;quot;, sep = &amp;quot;\t&amp;quot;, header = FALSE, quote = &amp;quot;&amp;quot;, 
    col.names = c(&amp;quot;Word&amp;quot;, &amp;quot;POS&amp;quot;), stringsAsFactors = FALSE)
nrow(words)

## [1] 295172&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The parts-of-speech is coded such that the letters before &lt;code&gt;|&lt;/code&gt; character come from the original &lt;a href='http://en.wikipedia.org/wiki/Moby_Project'&gt;Moby database&lt;/a&gt; and letters after the &lt;code&gt;|&lt;/code&gt; character come from &lt;a href='http://wordnet.princeton.edu/'&gt;WordNet&lt;/a&gt;. The first character corresponds to the primary classification. The following R code will split this field into two new variables, &lt;code&gt;Moby&lt;/code&gt; and &lt;code&gt;WordNet&lt;/code&gt;, and then strip the first character from &lt;code&gt;WordNet&lt;/code&gt; to create a &lt;code&gt;WordNetPrimary&lt;/code&gt; variable. We will use this classification later for plotting purposes.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;tmp &amp;lt;- lapply(words$POS, FUN = function(x) {
    x &amp;lt;- unlist(strsplit(x, &amp;quot;|&amp;quot;, fixed = TRUE))
    if (length(x) == 1) 
        return(c(NA, x[[1]])) else if (x[[1]] == &amp;quot;&amp;quot;) 
        return(c(NA, x[[2]])) else return(c(x[[1]], x[[2]]))
})
words$Moby &amp;lt;- sapply(tmp, function(x) x[1])
words$WordNet &amp;lt;- sapply(tmp, function(x) x[2])
words$WordNetPrimary &amp;lt;- substr(words$WordNet, 1, 1)
table(words$WordNetPrimary, useNA = &amp;quot;ifany&amp;quot;)

## 
##      !      A      C      D      h      i      N      p      P      r 
##    260  51914     54     60  71566   2239 119441   8506     99     85 
##      t      v      V   &amp;lt;NA&amp;gt; 
##  12399  13730  12124   2695&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We use the &lt;code&gt;grep&lt;/code&gt; function to get three vectors representing all the &amp;#8220;ie&amp;#8221;, &amp;#8220;ei&amp;#8221;, and &amp;#8220;cei&amp;#8221; words. We also print the number of each type word and the percentage of all words this represents.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;ie &amp;lt;- grep(&amp;quot;ie&amp;quot;, words$Word)
ei &amp;lt;- grep(&amp;quot;ei&amp;quot;, words$Word)
cei &amp;lt;- grep(&amp;quot;cei&amp;quot;, words$Word)
cie &amp;lt;- grep(&amp;quot;cie&amp;quot;, words$Word)

length(ie)

## [1] 10647

length(ie)/nrow(words) * 100

## [1] 3.607

length(ei)

## [1] 3542

length(ei)/nrow(words) * 100

## [1] 1.2

length(cei)

## [1] 202

length(cei)/nrow(words) * 100

## [1] 0.06843

length(cie)

## [1] 654

length(cie)/nrow(words) * 100

## [1] 0.2216&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Number of words that follow the rule, &amp;#8220;i before e except after c&amp;#8221;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;length(ie) + length(cei) - length(cie)

## [1] 10195&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Number of i after e words that are not after c (first way to break the rule).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;length(ei[!(ei %in% cei)])

## [1] 3340&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Number of i before e words that are after c (the other way to break the rule).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;length(cie)

## [1] 654&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Percentage of words that break the rule.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(length(ei[!(ei %in% cei)]) + length(cie))/sum(length(ie), length(ei)) * 100

## [1] 28.15&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;So of the 14,189 &amp;#8220;ie&amp;#8221; and &amp;#8220;ei&amp;#8221; words, 3,994 break the &amp;#8220;i before e, except after c&amp;#8221; rule, or about 28.1%.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s see how this breaks out by part-of-speech.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;thewords &amp;lt;- words[c(ie, ei), ]
thewords$BreakRule &amp;lt;- TRUE
thewords[which(row.names(thewords) %in% c(cei, ie[!(ie %in% cie)])), ]$BreakRule &amp;lt;- FALSE

# Counts
tab &amp;lt;- as.data.frame(table(thewords$WordNetPrimary, thewords$BreakRule, useNA = &amp;quot;ifany&amp;quot;))
tab &amp;lt;- merge(tab, partsOfSpeech, by.x = &amp;quot;Var1&amp;quot;, by.y = &amp;quot;Code&amp;quot;, all.x = TRUE)

ggplot(tab, aes(x = Description, y = Freq, fill = Var2)) + geom_bar(stat = &amp;quot;identity&amp;quot;, 
    position = &amp;quot;dodge&amp;quot;) + ylab(&amp;quot;Number of Words&amp;quot;) + xlab(&amp;quot;Part of Speech&amp;quot;) + 
    scale_fill_hue(&amp;quot;Break the Rule&amp;quot;) + ggtitle(&amp;quot;i Before e, Except After c&amp;quot;) + 
    coord_flip()&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src='/images/figure/IbeforeE1.png' alt='plot of chunk IbeforeE' /&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# Percentages
tab2 &amp;lt;- as.data.frame(prop.table(table(thewords$WordNetPrimary, thewords$BreakRule, 
    useNA = &amp;quot;ifany&amp;quot;), 1) * 100)
tab2 &amp;lt;- merge(tab2, partsOfSpeech, by.x = &amp;quot;Var1&amp;quot;, by.y = &amp;quot;Code&amp;quot;, all.x = TRUE)
ggplot(tab2, aes(x = Description, y = Freq, fill = Var2)) + geom_bar(stat = &amp;quot;identity&amp;quot;, 
    position = &amp;quot;dodge&amp;quot;) + ylab(&amp;quot;Percentage of Words by Part of Speech&amp;quot;) + xlab(&amp;quot;Part of Speech&amp;quot;) + 
    scale_fill_hue(&amp;quot;Break the Rule&amp;quot;) + ggtitle(&amp;quot;i Before e, Except After c&amp;quot;) + 
    coord_flip()&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src='/images/figure/IbeforeE2.png' alt='plot of chunk IbeforeE' /&gt;&lt;/p&gt;

&lt;p&gt;A few last details. Here is the proportional table of words that break the rule by part-of-speech. Lastly, the &lt;em&gt;definite article&lt;/em&gt; and &lt;em&gt;pronoun&lt;/em&gt; words (three of each) that all break the rule.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cast(tab2, Description ~ Var2, mean, value = &amp;quot;Freq&amp;quot;)

##              Description FALSE   TRUE
## 1              Adjective 83.17  16.83
## 2                 Adverb 66.56  33.44
## 3            Conjunction 25.00  75.00
## 4       Definite Article  0.00 100.00
## 5           Interjection 55.56  44.44
## 6                   Noun 65.84  34.16
## 7            Noun Phrase 63.70  36.30
## 8                Pronoun  0.00 100.00
## 9    Verb (intransitive) 54.55  45.45
## 10     Verb (transitive) 49.42  50.58
## 11 Verb (usu participle) 65.45  34.55
## 12                  &amp;lt;NA&amp;gt; 67.26  32.74

thewords[which(thewords$WordNetPrimary == &amp;quot;D&amp;quot;), ]

##           Word POS Moby WordNet WordNetPrimary BreakRule
## 113927  either DCv &amp;lt;NA&amp;gt;     DCv              D      TRUE
## 182679 neither DCv &amp;lt;NA&amp;gt;     DCv              D      TRUE
## 262111   their   D &amp;lt;NA&amp;gt;       D              D      TRUE

thewords[which(thewords$WordNetPrimary == &amp;quot;r&amp;quot;), ]

##               Word POS Moby WordNet WordNetPrimary BreakRule
## 262112      theirs   r &amp;lt;NA&amp;gt;       r              r      TRUE
## 262113   theirself   r &amp;lt;NA&amp;gt;       r              r      TRUE
## 262114 theirselves  rp &amp;lt;NA&amp;gt;      rp              r      TRUE&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id='part_ii__using_only_the_5000_most_frequently_used_words'&gt;Part II - Using only the 5,000 Most Frequently Used Words&lt;/h3&gt;

&lt;p&gt;Here is an update using the list of 5,000 most commonly used words from http://www.wordfrequency.info/top5000.asp (note there really are only 4,354 unique words since the same word can be used in different parts-of-speech). Of the 4,354 unique words, 96, or about 2.2%, have an &amp;#8220;ie&amp;#8221; or &amp;#8220;ei&amp;#8221; in the word. Of those 96 words, 31, or 32.3% break the &amp;#8220;i before e except after c&amp;#8221; rule.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;words &amp;lt;- read.csv(&amp;quot;MostUsedWords.csv&amp;quot;)
dups &amp;lt;- words[words$Word %in% words[duplicated(words$Word), ]$Word, ]
head(dups[order(dups$Word), ])

##      Rank  Word Part.of.speech Frequency Dispersion
## 47     46 about              i    874406       0.96
## 180   179 about              r    208550       0.97
## 897   896 above              i     44130       0.95
## 1604 1599 above              r     23866       0.92
## 1553 1548 abuse              n     24534       0.93
## 3783 3778 abuse              v      7554       0.94

length(unique(words$Word))

## [1] 4354

words &amp;lt;- words[!duplicated(words$Word), ]

ie &amp;lt;- grep(&amp;quot;ie&amp;quot;, words$Word)
ei &amp;lt;- grep(&amp;quot;ei&amp;quot;, words$Word)
cei &amp;lt;- grep(&amp;quot;cei&amp;quot;, words$Word)
cie &amp;lt;- grep(&amp;quot;cie&amp;quot;, words$Word)

# Percentage of words that break the rule.
(length(ei[!(ei %in% cei)]) + length(cie))/sum(length(ie), length(ei)) * 100

## [1] 32.29&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id='part_iii__weighted_by_frequency_of_words'&gt;Part III - Weighted by Frequency of Words&lt;/h3&gt;

&lt;p&gt;Using the same list as part II above, let&amp;#8217;s consider the word frequency. That is, we&amp;#8217;ll weight each word by it&amp;#8217;s frequency according to WordFrequency.info. Using this approach, 47% of &amp;#8220;ie&amp;#8221; words break the rule. Put another way, for each &amp;#8220;ie&amp;#8221; word you encounter reading, there is a 47% chance it does not follow the &amp;#8220;i before e, except after c&amp;#8221; rule.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;words &amp;lt;- read.csv(&amp;quot;MostUsedWords.csv&amp;quot;)
ie &amp;lt;- grep(&amp;quot;ie&amp;quot;, words$Word)
ei &amp;lt;- grep(&amp;quot;ei&amp;quot;, words$Word)
cei &amp;lt;- grep(&amp;quot;cei&amp;quot;, words$Word)
cie &amp;lt;- grep(&amp;quot;cie&amp;quot;, words$Word)
(sum(words[ei[!(ei %in% cei)], &amp;quot;Frequency&amp;quot;]) + sum(words[cie, &amp;quot;Frequency&amp;quot;]))/sum(words[ie, 
    &amp;quot;Frequency&amp;quot;], words[ei, &amp;quot;Frequency&amp;quot;]) * 100

## [1] 46.81&lt;/code&gt;&lt;/pre&gt; &lt;a href='http://jason.bryer.org/posts/2013-03-26/i_Before_e_Except_After_c.html'&gt;Read full post...&lt;/a&gt;</content>
  </entry>
  
  <entry>
    <id>http://jason.bryer.org/posts/2013-02-14/Version_1_multilevelPSA</id>
    <link type="text/html" rel="alternate" href="http://jason.bryer.org/posts/2013-02-14/Version_1_multilevelPSA.html"/>
    <title>Version 1.0 of multilevelPSA Available on CRAN</title>
    <published>2013-02-14T00:00:00-08:00</published>
    <updated>2013-02-14T00:00:00-08:00</updated>
    <author>
      <name>Jason Bryer</name>
      <uri>http://jason.bryer.org/</uri>
    </author>
    <content type="html">&lt;p&gt;Version 1.0 of &lt;code&gt;multilevelPSA&lt;/code&gt; has been released to CRAN. The &lt;code&gt;multilevelPSA&lt;/code&gt; package provides functions to estimate and visualize propensity score models with multilevel, or clustered, data. The graphics are an extension of &lt;a href='http://www.jstatsoft.org/v29/i06/paper'&gt;&lt;code&gt;PSAgraphics&lt;/code&gt;&lt;/a&gt; package by Helmreich and Pruzek. The example below will investigate the differences between private and public school internationally using the Programme of International Student Assessment (PISA). The &lt;code&gt;multilevelPSA&lt;/code&gt; package includes a subset of the full 2009 PISA dataset including North American countries (i.e. Canada, Mexico, &amp;amp; United States). However, the full dataset is available in the &lt;a href='/pisa'&gt;&lt;code&gt;pisa&lt;/code&gt;&lt;/a&gt; R package and can be downloaded using the &lt;code&gt;install_github&lt;/code&gt; function in the &lt;code&gt;devtools&lt;/code&gt; package (note that the package is approximately 80mb and as such, is not available on CRAN).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; install.packages(c(&amp;#39;multilevelPSA&amp;#39;,&amp;#39;devtools&amp;#39;), repos=&amp;#39;http://cran.r-project.org&amp;#39;)
&amp;gt; require(devtools)
&amp;gt; install_github(&amp;#39;pisa&amp;#39;,&amp;#39;jbryer&amp;#39;)
&amp;gt; require(multilevelPSA)
&amp;gt; require(pisa)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Load and setup the data. If the &lt;code&gt;pisa&lt;/code&gt; package is installed we will load and subset from the full PISA dataset, otherwise we will use only the North American countries.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; data(pisa.colnames) #Data catalog
&amp;gt; data(pisa.psa.cols) #Character vector listing the covariates we will use in phase I
&amp;gt; student &amp;lt;- NULL
if(require(pisa, quietly=TRUE)) {
	data(pisa.student)
	data(pisa.school)
	student = pisa.student[,c(&amp;#39;CNT&amp;#39;, &amp;#39;SCHOOLID&amp;#39;,
	                          paste0(&amp;#39;PV&amp;#39;, 1:5, &amp;#39;MATH&amp;#39;),
	                          paste0(&amp;#39;PV&amp;#39;, 1:5, &amp;#39;READ&amp;#39;),
	                          paste0(&amp;#39;PV&amp;#39;, 1:5, &amp;#39;SCIE&amp;#39;),
	                          pisa.psa.cols)]
	school = pisa.school[,c(&amp;#39;COUNTRY&amp;#39;, &amp;quot;CNT&amp;quot;, &amp;quot;SCHOOLID&amp;quot;,
	                        &amp;#39;SC02Q01&amp;#39;, #Public (1) or private (2)
	                        &amp;#39;STRATIO&amp;#39; #Student-teacher ratio 
	)]
	names(school) = c(&amp;#39;COUNTRY&amp;#39;, &amp;#39;CNT&amp;#39;, &amp;#39;SCHOOLID&amp;#39;, &amp;#39;PUBPRIV&amp;#39;, &amp;#39;STRATIO&amp;#39;)
	school$SCHOOLID = as.integer(school$SCHOOLID)
	school$CNT = as.character(school$CNT)
	student$SCHOOLID = as.integer(student$SCHOOLID)
	student$CNT = as.character(student$CNT)
	student = merge(student, school, by=c(&amp;#39;CNT&amp;#39;, &amp;#39;SCHOOLID&amp;#39;), all.x=TRUE)
	student = student[!is.na(student$PUBPRIV),] #Remove rows with missing PUBPRRIV
	rm(pisa.student)
	rm(pisa.school)
} else {
	data(pisana)
	student = pisana
	rm(pisana)
}&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id='phase_i'&gt;Phase I&lt;/h4&gt;

&lt;p&gt;Propensity score analysis is generally conducted in two phases. In phase one, the dependent measure of interest is treatment placement. In this example, we will consider attending private school to be the treatment. There are a variety of methods we can use including logistic regression (see the &lt;code&gt;mlpsa.logistic&lt;/code&gt; function) and classification trees (see the &lt;code&gt;mlpsa.ctree&lt;/code&gt; function). In this example we will use the &lt;code&gt;ctree&lt;/code&gt; function in the &lt;code&gt;party&lt;/code&gt; package to model private school attendance using the conditional inference tree framework.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; mlctree = mlpsa.ctree(student[,c(&amp;#39;CNT&amp;#39;,&amp;#39;PUBPRIV&amp;#39;,pisa.psa.cols)], 
                        formula=PUBPRIV ~ ., level2=&amp;#39;CNT&amp;#39;)
&amp;gt; student.party = getStrata(mlctree, student, level2=&amp;#39;CNT&amp;#39;)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;mlpsa.ctree&lt;/code&gt; estimates separate models for each level. As a result, different sets of covariates are likely to be used within each level. The &lt;code&gt;tree.plot&lt;/code&gt; function creates a heat map of covariate use by level. The shading indicates the shallowest depth each covariate appears. That is, if a covariate is utilized more than once within a tree, then the smallest depth will be used to determine the shading.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; tree.plot(mlctree, level2Col=student$CNT)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src='/images/multilevelPSA/pisatree.png' alt='Multilevel PSA Tree Plot' /&gt;&lt;/p&gt;

&lt;h4 id='phase_ii'&gt;Phase II&lt;/h4&gt;

&lt;p&gt;Phase two involves comparing students between the two groups with &lt;em&gt;similar&lt;/em&gt; covariate profiles. In the case of classification trees, we consider students in the same leaf node to have sufficient similar covariate balance. However, it is important to verify that sufficient covariate balance has been achieved and the functions in the &lt;code&gt;PSAgraphics&lt;/code&gt; can assist with that.&lt;/p&gt;

&lt;p&gt;First, we will calculate a mean math score of the five plausible values provided. This is not entirely correct and final tabular results should utilize all the plausible values separately and combined to provide a pooled estimate (see the &lt;a href='http://faculty.washington.edu/tlumley/survey/'&gt;&lt;code&gt;survey&lt;/code&gt;&lt;/a&gt; package for example). However, for visualization purposes the mean value will provide a sufficient estimate of the final results.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; student.party$mathscore = apply(student.party[,paste0(&amp;#39;PV&amp;#39;, 1:5, &amp;#39;MATH&amp;#39;)], 1, sum) / 5

&amp;gt; results.psa.math = mlpsa(response=student.party$mathscore, 
                           treatment=student.party$PUBPRIV, 
                           strata=student.party$strata, 
                           level2=student.party$CNT, minN=5)

 &amp;gt; summary(results.psa.math)
 Multilevel PSA Model of 694 strata for 64 levels.
 Approx t: 3.17
 Confidence Interval: 15.71, 18.41&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;mlpsa&lt;/code&gt; function returns an object of class type &lt;code&gt;mlpsa&lt;/code&gt;. The S3 methods for &lt;code&gt;summary&lt;/code&gt;, &lt;code&gt;print&lt;/code&gt;, &lt;code&gt;plot&lt;/code&gt;, and &lt;code&gt;xtable&lt;/code&gt; are implemented. Additionally, the returned object contains the following elements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;approx.t&lt;/code&gt; The approximate overall &lt;em&gt;t&lt;/em&gt;-value.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;level1.summary&lt;/code&gt; A data frame containing the results of each individual strata.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;level2.summary&lt;/code&gt; A data frame with the overall results for the clustering level two variable (country in this example).&lt;/li&gt;

&lt;li&gt;&lt;code&gt;overall.ci&lt;/code&gt; An integer vector with two values for the overall confidence interval.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;overall.mnx&lt;/code&gt; The overall adjusted mean for the control group (i.e. public schools).&lt;/li&gt;

&lt;li&gt;&lt;code&gt;overall.mny&lt;/code&gt; The overall adjusted mean for the treatment group (i.e. private schools).&lt;/li&gt;

&lt;li&gt;&lt;code&gt;overall.n&lt;/code&gt; The overall &lt;em&gt;n&lt;/em&gt;.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;overall.nx&lt;/code&gt; The overall &lt;em&gt;n&lt;/em&gt; for the control group (i.e. public schools).&lt;/li&gt;

&lt;li&gt;&lt;code&gt;overall.ny&lt;/code&gt; The overall &lt;em&gt;n&lt;/em&gt; for the treatment group (i.e. private schools).&lt;/li&gt;

&lt;li&gt;&lt;code&gt;overall.se.wtd&lt;/code&gt; The overall weighted standard error of the mean difference.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;overall.wtd&lt;/code&gt; The overall weighted difference.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;removed&lt;/code&gt; An integer vector with the positions of rows from the original data frame that were removed do to insufficiently small strata (default minimum strata size is five, see &lt;code&gt;minN&lt;/code&gt; parameter of &lt;code&gt;mlpsa&lt;/code&gt; function).&lt;/li&gt;

&lt;li&gt;&lt;code&gt;unweighted.summary&lt;/code&gt; Data frame containing the overall unadjusted means for the treatment and control groups.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Other fields not listed above are used for plotting purposes.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; &amp;gt; results.psa.math$level2.summary[,c(&amp;#39;level2&amp;#39;,&amp;#39;n&amp;#39;,&amp;#39;Private&amp;#39;,&amp;#39;Private.n&amp;#39;,&amp;#39;Public&amp;#39;,&amp;#39;Public.n&amp;#39;,
                                      &amp;#39;diffwtd&amp;#39;,&amp;#39;ci.min&amp;#39;,&amp;#39;ci.max&amp;#39;,&amp;#39;df&amp;#39;)]

    level2     n  Private Private.n   Public Public.n      diffwtd     ci.min     ci.max    df
 1     ALB  4596 420.8655       395 373.4865     4201  47.37901925  37.112435  57.645604  4578
 2     ARG  4707 409.9844      1535 381.7359     3172  28.24847947  19.860348  36.636611  4681
 3     AUS 14251 525.7220      5536 497.6153     8715  28.10669008  23.477031  32.736349 14193
 4     AUT  6405 494.3873       842 500.5522     5563  -6.16487922 -16.364671   4.034913  6377
 5     AZE  2520 489.5733        80 437.1517     2440  52.42154979  41.202358  63.640742  2514
 6     BEL  8488 530.4688      5830 490.3529     2658  40.11586405  32.093483  48.138245  8460
 7     BGR  4488 510.5927        59 426.6689     4429  83.92386223  64.984555 102.863169  4480
 8     BRA 19112 420.6211      2265 370.9140    16847  49.70712118  44.353349  55.060893 19032
 9     CAN 23035 574.6481      1609 512.7957    21426  61.85247013  54.140194  69.564746 22981
 10    CHE 11645 538.9332       383 529.5514    11262   9.38179653  -4.692227  23.455820 11631
 11    CHL  5161 432.2037      3022 414.2283     2139  17.97548415   8.280336  27.670632  5135
 12    COL  7695 410.2222      1448 381.9332     6247  28.28900773  22.479379  34.098637  7651
 13    CZE  5751 513.4700       270 510.6293     5481   2.84071242  -8.075300  13.756725  5745
 14    DEU  4555 529.8054       241 510.5953     4314  19.21013318   3.206619  35.213647  4545
 15    DNK  5839 500.7247      1041 486.4226     4798  14.30205768   5.809685  22.794430  5825
 16    ESP 25363 501.6931     10034 484.6639    15329  17.02913971  13.582441  20.475839 25295
 17    EST  4727 514.0800       127 513.8796     4600   0.20041004 -14.952982  15.353802  4723
 18    FIN  5755 533.7971       279 539.0404     5476  -5.24326787 -18.901244   8.414708  5743
 19    GBR  8202 538.5366       439 501.2581     7763  37.27846710  27.933956  46.622978  8176
 20    GRC  4665 489.9827       321 468.1922     4344  21.79050984  10.808237  32.772783  4643
 21    HKG  4804 552.7403      4486 591.3133      318 -38.57307922 -50.860842 -26.285316  4800
 22    HRV  3059 468.1499        66 468.1106     2993   0.03929578 -19.559319  19.637911  3051
 23    HUN  4583 501.5931       539 494.2722     4044   7.32087259  -4.284299  18.926044  4571
 24    IDN  5136 362.4735      2375 381.7147     2761 -19.24121312 -23.520458 -14.961968  5118
 25    IRL  3928 493.7999      2462 478.9441     1466  14.85584371   3.175379  26.536309  3916
 26    ISL  3207 538.7074        17 505.9170     3190  32.79040612  -7.925951  73.506763  3205
 27    ISR  5607 471.0903       977 445.0270     4630  26.06329628  15.921608  36.204985  5585
 28    ITA 30234 464.5795      1641 491.9211    28593 -27.34161124 -37.826591 -16.856631 30208
 29    JOR  6439 419.7438       890 389.4549     5549  30.28893975  20.712494  39.865385  6411
 30    JPN  6088 521.8295      1672 532.6987     4416 -10.86927112 -20.465999  -1.272544  6070
 31    KAZ  3688 439.8842       140 415.6892     3548  24.19497667   5.874004  42.515949  3670
 32    KGZ  4128 418.3577       111 334.0767     4017  84.28102627  70.077098  98.484955  4118
 33    KOR  4989 549.2523      1898 548.4127     3091   0.83959029  -7.121606   8.800787  4961
 34    LIE   329 523.4837        18 536.4461      311 -12.96248124 -58.831205  32.906242   327
 35    LTU  4500 471.7728        44 476.7377     4456  -4.96495842 -26.761449  16.831532  4498
 36    LUX  4613 498.1891       666 493.4815     3947   4.70766110  -4.990032  14.405354  4599
 37    LVA  4343 469.8542        30 488.0522     4313 -18.19794867 -49.563973  13.168076  4339
 38    MAC  5628 527.8113      5392 480.8967      236  46.91466591  34.677379  59.151953  5618
 39    MEX 38124 430.2105      4044 423.0054    34080   7.20509210   4.038334  10.371850 38038
 40    MNE    41 386.5525        13 369.8808       28  16.67167582 -19.164385  52.507737    39
 41    NLD  4667 531.9771      2872 535.8870     1795  -3.90982011 -18.107778  10.288137  4657
 42    NOR  4353 462.8531        49 499.1450     4304 -36.29191402 -59.376655 -13.207173  4345
 43    NZL  4643 573.8157       242 519.8993     4401  53.91640732  44.564926  63.267889  4631
 44    PAN  3608 409.4171       919 344.0597     2689  65.35745641  56.961259  73.753654  3568
 45    PER  5985 387.2319      1155 357.5425     4830  29.68934040  19.766241  39.612440  5941
 46    POL  4803 522.3807       328 496.4239     4475  25.95684445  14.719257  37.194432  4787
 47    PRT  6298 503.2090       682 483.6007     5616  19.60823945  11.927275  27.289204  6286
 48    QAR  4435 444.2281      3281 384.2296     1154  59.99852251  51.270428  68.726617  4371
 49    QAT  7856 410.2191      2244 344.6134     5612  65.60569937  58.660056  72.551343  7742
 50    QCN  4966 627.8917       512 597.5703     4454  30.32137741  15.221133  45.421622  4954
 51    ROU  1017 370.9295        15 387.7845     1002 -16.85494354 -46.227241  12.517354  1013
 52    RUS  5308 469.4605        12 469.5508     5296  -0.09026699 -42.681943  42.501409  5306
 53    SGP  4981 524.9804       126 557.3697     4855 -32.38926291 -53.436208 -11.342317  4973
 54    SRB  5353 410.7185        53 442.6754     5300 -31.95684080 -57.341095  -6.572587  5347
 55    SVK  4555 516.3498       343 494.3370     4212  22.01279817   6.597103  37.428493  4547
 56    SVN  6155 566.1863       129 477.3868     6026  88.79944245  73.648893 103.949992  6145
 57    SWE  4567 516.5581       543 492.3527     4024  24.20539406  14.103703  34.307085  4555
 58    TAP  5785 518.3554      2225 566.2069     3560 -47.85153187 -54.453618 -41.249446  5759
 59    THA  6209 406.9979       800 429.9237     5409 -22.92581091 -35.658631 -10.192991  6197
 60    TTO  4604 402.7190       815 420.4209     3789 -17.70190852 -25.938477  -9.465340  4594
 61    TUN  2414 301.4003        95 377.2423     2319 -75.84192626 -88.562235 -63.121618  2404
 62    TUR   125 578.1132        17 515.9970      108  62.11617396  32.725739  91.506609   121
 63    URY  5462 469.0636      1018 416.3180     4444  52.74565867  44.245235  61.246083  5414
 64    USA  5233 504.5025       345 484.7964     4888  19.70610175   7.267340  32.144863  5215&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;em&gt;Multilevel PSA Assessment Plot&lt;/em&gt; provides a detailed visualization of the results. The default &lt;code&gt;plot&lt;/code&gt; method is comprised of three panels (each panel can be plotted separately, see the commands below). The panels to the left and bottom represent the means for private and public schools, respectively. Grey dots correspond to individual strata and the colored dots to the overall mean for that country. The main plot is a scatter plot with the overall of public school mean on the &lt;em&gt;x&lt;/em&gt;-axis and the private school mean on the &lt;em&gt;y&lt;/em&gt;-axis. The size of the dots are proportional to the number of students sampled within each country. Rug plots on the right and top represent the distribution of scores. For each point, a line is projected parallel to the unit line (i.e. &lt;em&gt;y=x&lt;/em&gt;) to another line perpendicular to the unit line. The tick marks along that line correspond to the distribution of differences. That is, the distance of each tick mark to the unit line is equal to the difference between private and public school scores for that country (a separate plot for differences is provided below). Therefore, the unit line indicates no difference such that points that lie above the line indicate a difference favoring private schools and points that lie below the line indicate a difference favoring public schools. The dashed blue line parallel to the unit line represents the overall mean difference across all countries and the dashed green line represents the confidence interval of that difference. In this example, given the relatively large &lt;em&gt;n&lt;/em&gt;, the confidence interval is so narrow that it is practically overlaps the mean.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; plot(results.psa.math)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src='/images/multilevelPSA/pisaAssessmentPlot.png' alt='Multilevel PSA Assessment Plot' /&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href='/images/multilevelPSA/pisaAssessmentPlotLarge.png'&gt;Click here&lt;/a&gt; for a larger version of the PSA Assessment Plot.&lt;/p&gt;

&lt;p&gt;From this figure, we can conclude that there is a small, statistically significant, effect in favor of private schools. However, for many countries, that difference is small as exemplified by the fact that many of the points cluster around the unit line. And there are some countries where public schools outperform private schools. The largest of these is Tunisia, although the overall performance of that country is also the lowest when adjusted for private or public school attendance.&lt;/p&gt;

&lt;p&gt;The difference plot below provides more detail with regard to the distribution of differences. In this plot, the grey points correspond to the difference of each strata (i.e. the leaf nodes from phase I above in this example). The blue dots correspond to the overall difference for each country, and like above, the size is proportional to the number of students sampled within each country. However, this figure also included confidence intervals for each country as well as overall. Since the standard deviation was specified vis-à-vis the &lt;code&gt;sd&lt;/code&gt; parameter, the scale of the &lt;em&gt;x&lt;/em&gt;-axis is in standardized units.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; mlpsa.difference.plot(results.psa.math, sd=mean(student.party$mathscore, na.rm=TRUE))&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src='/images/multilevelPSA/pisaDiffPlot.png' alt='Multilevel Difference Plot' /&gt;&lt;/p&gt;

&lt;p&gt;You can plot the individual parts of the Multilevel PSA Assessment plot with the following functions.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; mlpsa.circ.plot(results.psa.math, legendlab=FALSE)
&amp;gt; mlpsa.distribution.plot(results.psa.math, &amp;#39;Public&amp;#39;)
&amp;gt; mlpsa.distribution.plot(results.psa.math, &amp;#39;Private&amp;#39;)&lt;/code&gt;&lt;/pre&gt; &lt;a href='http://jason.bryer.org/posts/2013-02-14/Version_1_multilevelPSA.html'&gt;Read full post...&lt;/a&gt;</content>
  </entry>
  
  <entry>
    <id>http://jason.bryer.org/posts/2013-01-30/Converting_a_list_to_a_data_frame</id>
    <link type="text/html" rel="alternate" href="http://jason.bryer.org/posts/2013-01-30/Converting_a_list_to_a_data_frame.html"/>
    <title>Converting a list to a data frame</title>
    <published>2013-01-30T00:00:00-08:00</published>
    <updated>2013-01-30T00:00:00-08:00</updated>
    <author>
      <name>Jason Bryer</name>
      <uri>http://jason.bryer.org/</uri>
    </author>
    <content type="html">&lt;p&gt;There are many situations in R where you have a &lt;code&gt;list&lt;/code&gt; of &lt;code&gt;vector&lt;/code&gt;s that you need to convert to a &lt;code&gt;data.frame&lt;/code&gt;. This question has been addressed over at &lt;a href='http://stackoverflow.com/questions/4227223/r-list-to-data-frame'&gt;StackOverflow&lt;/a&gt; and it turns out there are many different approaches to completing this task. Since I encounter this situation relatively frequently, I wanted my own S3 method for &lt;code&gt;as.data.frame&lt;/code&gt; that takes a &lt;code&gt;list&lt;/code&gt; as its parameter. I should note that it only works with atomic vectors (i.e. logical, integer, numeric, complex, character and raw). If any one of the elements in the &lt;code&gt;list&lt;/code&gt; are of some other class type, the function will call &lt;code&gt;NextMethod&lt;/code&gt;. However, on my R instance at least, this will end up calling &lt;code&gt;as.data.frame.default&lt;/code&gt; which will in turn throw an error.&lt;/p&gt;

&lt;p&gt;To use the function you can source the function directly from Gist using the &lt;code&gt;source_gist&lt;/code&gt; function in the &lt;code&gt;devtools&lt;/code&gt; package.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;require(devtools)
source_gist(4676064)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Or you can download the code at &lt;a href='https://gist.github.com/4676064'&gt;https://gist.github.com/4676064&lt;/a&gt;&lt;/p&gt;

&lt;h4 id='example_one'&gt;Example One&lt;/h4&gt;

&lt;p&gt;In this first example we have a list with two vectors, each with the same length and the same names.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; test1 &amp;lt;- list( c(a=&amp;#39;a&amp;#39;,b=&amp;#39;b&amp;#39;,c=&amp;#39;c&amp;#39;), c(a=&amp;#39;d&amp;#39;,b=&amp;#39;e&amp;#39;,c=&amp;#39;f&amp;#39;))
&amp;gt; as.data.frame(test1)
  a b c
1 a b c
2 d e f&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id='example_two'&gt;Example Two&lt;/h4&gt;

&lt;p&gt;In this example we have a list of two vectors, same length, but only one has names. The function in this case will use the names from the first vector with names for the column names of the data frame.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; test2 &amp;lt;- list( c(&amp;#39;a&amp;#39;,&amp;#39;b&amp;#39;,&amp;#39;c&amp;#39;), c(a=&amp;#39;d&amp;#39;,b=&amp;#39;e&amp;#39;,c=&amp;#39;f&amp;#39;))
&amp;gt; as.data.frame(test2)
  a b c
1 a b c
2 d e f&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id='example_three'&gt;Example Three&lt;/h4&gt;

&lt;p&gt;This example has two named vectors, but only have one overlapping named element.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; test3 &amp;lt;- list(&amp;#39;Row1&amp;#39;=c(a=&amp;#39;a&amp;#39;,b=&amp;#39;b&amp;#39;,c=&amp;#39;c&amp;#39;), &amp;#39;Row2&amp;#39;=c(a=&amp;#39;d&amp;#39;,var2=&amp;#39;e&amp;#39;,var3=&amp;#39;f&amp;#39;))
&amp;gt; as.data.frame(test3)
     a    b    c var2 var3
Row1 a    b    c &amp;lt;NA&amp;gt; &amp;lt;NA&amp;gt;
Row2 d &amp;lt;NA&amp;gt; &amp;lt;NA&amp;gt;    e    f&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id='example_four'&gt;Example Four&lt;/h4&gt;

&lt;p&gt;This is an example of what to avoid, three vectors of differing lengths and not named. The number of columns in the resulting data frame will be equal to the longest vector. For vectors less than that, &lt;code&gt;NA&lt;/code&gt;s will be filled in on the right most columns. This method will also print a warning.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; test4 &amp;lt;- list(&amp;#39;Row1&amp;#39;=letters[1:5], &amp;#39;Row2&amp;#39;=letters[1:7], &amp;#39;Row3&amp;#39;=letters[8:14])
&amp;gt; as.data.frame(test4)
     Col1 Col2 Col3 Col4 Col5 Col6 Col7
Row1    a    b    c    d    e &amp;lt;NA&amp;gt; &amp;lt;NA&amp;gt;
Row2    a    b    c    d    e    f    g
Row3    h    i    j    k    l    m    n
Warning message:
In as.data.frame.list(test4) :
  The length of vectors are not the same and do not are not named, the results may not be correct.&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id='example_five'&gt;Example Five&lt;/h4&gt;

&lt;p&gt;Another example of equal length vectors.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; test5 &amp;lt;- list(letters[1:10], letters[11:20])
&amp;gt; as.data.frame(test5)
  X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1  a  b  c  d  e  f  g  h  i   j
2  k  l  m  n  o  p  q  r  s   t&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id='example_six'&gt;Example Six&lt;/h4&gt;

&lt;p&gt;This example shows the warning (and likely error too) that occurs when all of the elements of the list are not atomic vectors.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; test6 &amp;lt;- list(list(letters), letters)
&amp;gt; as.data.frame(test6)
Error in as.data.frame.default(test6, row.names = NULL, optional = FALSE) : 
  cannot coerce class &amp;#39;&amp;quot;list&amp;quot;&amp;#39; into a data.frame
In addition: Warning message:
In as.data.frame.list(test6) : All elements of the list must be a vector.&lt;/code&gt;&lt;/pre&gt; &lt;a href='http://jason.bryer.org/posts/2013-01-30/Converting_a_list_to_a_data_frame.html'&gt;Read full post...&lt;/a&gt;</content>
  </entry>
  
  <entry>
    <id>http://jason.bryer.org/posts/2013-01-24/Comparing_Two_Data_Frames</id>
    <link type="text/html" rel="alternate" href="http://jason.bryer.org/posts/2013-01-24/Comparing_Two_Data_Frames.html"/>
    <title>Comparing two data frames with different number of rows</title>
    <published>2013-01-24T00:00:00-08:00</published>
    <updated>2013-01-24T00:00:00-08:00</updated>
    <author>
      <name>Jason Bryer</name>
      <uri>http://jason.bryer.org/</uri>
    </author>
    <content type="html">&lt;p&gt;I posted a question over on &lt;a href='http://stackoverflow.com/questions/14485040/is-there-an-efficient-way-of-comparing-two-data-frames'&gt;StackOverflow&lt;/a&gt; on an efficient way of comparing two data frames with the same column structure, but with different rows. What I would like to end up with is an &lt;em&gt;n&lt;/em&gt; x &lt;em&gt;m&lt;/em&gt; logical matrix where &lt;em&gt;n&lt;/em&gt; and &lt;em&gt;m&lt;/em&gt; are the number of rows in the first and second data frames, respectively; and the value at the &lt;em&gt;i&lt;/em&gt;th row and &lt;em&gt;j&lt;/em&gt;th column indicates whether all the values from row &lt;em&gt;i&lt;/em&gt; from data frame one is equal to row &lt;em&gt;j&lt;/em&gt; from data frame two. To provide some context, this will be used in a propensity score matching algorithm to identify candidate matches that match exactly on any number of covariates. In addition to the approaches I had, &lt;a href='http://stackoverflow.com/users/324364/joran'&gt;joran&lt;/a&gt; provided an approach using the &lt;code&gt;Vectorize&lt;/code&gt; function (thanks again as I learned another nice function). I decided to put three approaches to a race&amp;#8230;&lt;/p&gt;

&lt;p&gt;To understand what I need, I&amp;#8217;ll start with a small example with two data frames, one with 4 rows, the other with 3, and each has two variables, one logical and the other numeric. As an aside, I only need this to work for integers, factors, characters, and logical types therefore avoiding issues of comparing numerics.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; df1 &amp;lt;- data.frame(row.names=1:4, var1=c(TRUE, TRUE, FALSE, FALSE), var2=c(1,2,3,4))
&amp;gt; df2 &amp;lt;- data.frame(row.names=5:7, var1=c(FALSE, TRUE, FALSE), var2=c(5,2,3))
&amp;gt; df1
   var1 var2
1  TRUE    1
2  TRUE    2
3 FALSE    3
4 FALSE    4
&amp;gt; df2
   var1 var2
5 FALSE    5
6  TRUE    2
7 FALSE    3&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;First, let&amp;#8217;s consider the case when there is only one variable:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; system.time({
+ 	df3 &amp;lt;- sapply(df2$var1, FUN=function(x) { x == df1$var1 })
+ 	dimnames(df3) &amp;lt;- list(row.names(df1), row.names(df2))
+ })
   user  system elapsed 
      0       0       0 
&amp;gt; df3
      5     6     7
1 FALSE  TRUE FALSE
2 FALSE  TRUE FALSE
3  TRUE FALSE  TRUE
4  TRUE FALSE  TRUE&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is pretty straight forward. Now I want the same type of result, but to compare more than one column (in the final implementation I need to handle any number of columns so not necessarily limited to one or two).&lt;/p&gt;

&lt;p&gt;The first approach uses nested apply functions.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; system.time({
+ 	m1 &amp;lt;- t(as.matrix(df1))
+ 	m2 &amp;lt;- as.matrix(df2)
+ 	df4 &amp;lt;- apply(m2, 1, FUN=function(x) { apply(m1, 2, FUN=function(y) { all(x == y) } ) })
+ })
   user  system elapsed 
  0.001   0.000   0.001 
&amp;gt; df4
      5     6     7
1 FALSE FALSE FALSE
2 FALSE  TRUE FALSE
3 FALSE FALSE  TRUE
4 FALSE FALSE FALSE&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Secondly, using the &lt;code&gt;Vectorize&lt;/code&gt; and &lt;code&gt;outer&lt;/code&gt; functions.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; system.time({
+ 	foo &amp;lt;- Vectorize(function(x,y) { all(df1[x,] == df2[y,]) })
+ 	df5 &amp;lt;- outer(1:nrow(df1), 1:nrow(df2), FUN=foo)
+ })
   user  system elapsed 
  0.005   0.000   0.006 
&amp;gt; df5
      [,1]  [,2]  [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE  TRUE FALSE
[3,] FALSE FALSE  TRUE
[4,] FALSE FALSE FALSE&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Lastly, we&amp;#8217;ll create a new character vector by pasting the other variables together.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; system.time({
+ 	df1$var3 &amp;lt;- apply(df1, 1, paste, collapse=&amp;#39;.&amp;#39;)
+ 	df2$var3 &amp;lt;- apply(df2, 1, paste, collapse=&amp;#39;.&amp;#39;)
+ 	df6 &amp;lt;- sapply(df2$var3, FUN=function(x) { x == df1$var3 })
+ 	dimnames(df6) &amp;lt;- list(row.names(df1), row.names(df2))
+ })
   user  system elapsed 
  0.000   0.000   0.001 
&amp;gt; df6
      5     6     7
1 FALSE FALSE FALSE
2 FALSE  TRUE FALSE
3 FALSE FALSE  TRUE
4 FALSE FALSE FALSE&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can already see with this small example that the &lt;code&gt;Vectorize&lt;/code&gt; approach is the slowest. However, let&amp;#8217;s try a larger example. First we&amp;#8217;ll create two data frames, one with 1,000 rows and the second with 1,500. The resulting matrix will be 1,000 x 1,500.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;set.seed(2112)
df1 &amp;lt;- data.frame(row.names=1:1000, 
				  var1=sample(c(TRUE,FALSE), 1000, replace=TRUE), 
				  var2=sample(1:10, 1000, replace=TRUE) )
df2 &amp;lt;- data.frame(row.names=1001:2500, 
				  var1=sample(c(TRUE,FALSE), 1500, replace=TRUE),
				  var2=sample(1:10, 1500, replace=TRUE))&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Nested &lt;code&gt;apply&lt;/code&gt; functions approach:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; system.time({
+ 	m1 &amp;lt;- t(as.matrix(df1))
+ 	m2 &amp;lt;- as.matrix(df2)
+ 	df4 &amp;lt;- apply(m2, 1, FUN=function(x) { apply(m1, 2, FUN=function(y) { all(x == y) } ) })
+ })
   user  system elapsed 
 10.807   0.043  11.096 &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;Vectorize&lt;/code&gt; approach:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; system.time({
+ 	foo &amp;lt;- Vectorize(function(x,y) { all(df1[x,] == df2[y,]) })
+ 	df5 &amp;lt;- outer(1:nrow(df1), 1:nrow(df2), FUN=foo)
+ })
   user  system elapsed 
390.904   0.808 392.134 &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Combined columns approach:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; system.time({
+ 	df1$var3 &amp;lt;- apply(df1, 1, paste, collapse=&amp;#39;.&amp;#39;)
+ 	df2$var3 &amp;lt;- apply(df2, 1, paste, collapse=&amp;#39;.&amp;#39;)
+ 	df6 &amp;lt;- sapply(df2$var3, FUN=function(x) { x == df1$var3 })
+ 	dimnames(df6) &amp;lt;- list(row.names(df1), row.names(df2))
+ })
   user  system elapsed 
  0.421   0.000   0.422 &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The combined column approach is by far the fasted way, and it makes good since. It is a bit surprising (at least to me), how much worse the &lt;code&gt;Vectorize&lt;/code&gt; and &lt;code&gt;outer&lt;/code&gt; functions are. Moreover, I am a bit concerned about potential issues with the &lt;code&gt;paste&lt;/code&gt; method and doing comparisons on those results. Please feel free to leave comments below if there are other approaches.&lt;/p&gt; &lt;a href='http://jason.bryer.org/posts/2013-01-24/Comparing_Two_Data_Frames.html'&gt;Read full post...&lt;/a&gt;</content>
  </entry>
  
  <entry>
    <id>http://jason.bryer.org/posts/2013-01-15/Version_1_sqlutils</id>
    <link type="text/html" rel="alternate" href="http://jason.bryer.org/posts/2013-01-15/Version_1_sqlutils.html"/>
    <title>Version 1.0 of sqlutils available on CRAN</title>
    <published>2013-01-15T00:00:00-08:00</published>
    <updated>2013-01-15T00:00:00-08:00</updated>
    <author>
      <name>Jason Bryer</name>
      <uri>http://jason.bryer.org/</uri>
    </author>
    <content type="html">&lt;p&gt;Version 1.0 of &lt;code&gt;sqlutils&lt;/code&gt; has been released to CRAN. The &lt;code&gt;sqlutils&lt;/code&gt; package is designed to manage a library of SQL files. This package grew out of the needs of an Office of Institutional Research where the vast majority of analysis is conducted on data from our Student Information System (SIS) which is stored in an Oracle database. A lot of our analyses and reports are derived from the same types of datasets but from easily extracted parameters (e.g. date range, program name, status, etc.). We used to store SQL commands in our R scripts but that can become quite cumbersome and in many ways, reduced the ease of reusability which is a major reason for using R in the first place, hence the birth of &lt;code&gt;sqlutils&lt;/code&gt;. For our purposes we currently have over 40 SQL files that have been well vetted and documented. To share the library we simply add the following to our &lt;code&gt;.Rprofile&lt;/code&gt; script:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;require(sqlutils)
sqlPaths(&amp;#39;/Path/to/shared/directory&amp;#39;)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;A &lt;a href='/sqlutils'&gt;full introduction to the &lt;code&gt;squtils&lt;/code&gt; package is available here&lt;/a&gt; as well as on the &lt;a href='http://github.com/jbryer/sqlutils'&gt;Github project page&lt;/a&gt;. A key advantage to using &lt;code&gt;sqlutils&lt;/code&gt; is that you can store your queries in plain text files (with a &lt;code&gt;.sql&lt;/code&gt; file extension) and document them using &lt;code&gt;roxygen2&lt;/code&gt; style comments. Moreover, R function parameters are used to set parameters within the SQL command. Parameters are defined in SQL files using colon, parameter name, colon (i.e. &lt;code&gt;:paramName:&lt;/code&gt;) format. Using this framework, it is easy to create a &lt;a href='/sqlutils/datadictionary.html'&gt;data dictionary&lt;/a&gt; of the library of SQL files.&lt;/p&gt;

&lt;p&gt;Lastly, I wrote about an &lt;a href='/posts/2013-01-12/Interactive_SQL_in_R.html'&gt;interactive SQL&lt;/a&gt; mode in R a few days ago. The &lt;code&gt;isql&lt;/code&gt; function is included in the &lt;code&gt;sqlutils&lt;/code&gt; package.&lt;/p&gt; &lt;a href='http://jason.bryer.org/posts/2013-01-15/Version_1_sqlutils.html'&gt;Read full post...&lt;/a&gt;</content>
  </entry>
  
  <entry>
    <id>http://jason.bryer.org/posts/2013-01-12/Interactive_SQL_in_R</id>
    <link type="text/html" rel="alternate" href="http://jason.bryer.org/posts/2013-01-12/Interactive_SQL_in_R.html"/>
    <title>Interactive SQL in R</title>
    <published>2013-01-12T00:00:00-08:00</published>
    <updated>2013-01-12T00:00:00-08:00</updated>
    <author>
      <name>Jason Bryer</name>
      <uri>http://jason.bryer.org/</uri>
    </author>
    <content type="html">&lt;p&gt;I recently taught a very basic introduction to SQL workshop and needed a way to have participants interact with SQL statements. Obviously there are lots of tools to interface with a database, but since we are all R users I thought it would be nice to be able interact without leaving R. Although this interface is fairly basic, the fact that we can type in a SQL statement and get the results as an R data frame provides all the advantages of having data in R. Moreover, I found this to be an interesting exercise in see the power of R as programming language, not just as statistical software. The function described here is part of the &lt;a href='/sqlutils'&gt;&lt;code&gt;sqlutils&lt;/code&gt;&lt;/a&gt; package which was created to manage a library of SQL files. More information about that is provided on the &lt;a href='/sqlutils'&gt;project page&lt;/a&gt; and I will likely have a forthcoming blog post too.&lt;/p&gt;

&lt;p&gt;First we need to create a database to interact with. In this example we will use the &lt;code&gt;students&lt;/code&gt; data frame from the &lt;a href='/retention'&gt;&lt;code&gt;retention&lt;/code&gt;&lt;/a&gt; package. We will save this data frame into a SQLite database using the RSQLite package. The R code to setup the database is provided as a demo in the package. Type &lt;code&gt;demo(&amp;#39;isql&amp;#39;)&lt;/code&gt; to start.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;require(sqlutils)
require(RSQLite)
require(retention)
data(students)
students$CreatedDate = as.character(students$CreatedDate)
m &amp;lt;- dbDriver(&amp;quot;SQLite&amp;quot;)
tmpfile &amp;lt;- tempfile(&amp;#39;students.db&amp;#39;, fileext=&amp;#39;.db&amp;#39;)
conn &amp;lt;- dbConnect(m, dbname=tmpfile)
dbWriteTable(conn, &amp;quot;students&amp;quot;, students[!is.na(students$CreatedDate),])&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We begin an interactive SQL environment with the &lt;code&gt;isql&lt;/code&gt; function. The only required parameter is &lt;code&gt;conn&lt;/code&gt; which is the connection to the database that SQL statements will be executed. The &lt;code&gt;sql&lt;/code&gt; parameter is optional and sets the initial SQL statement for the session that can be edited or executed.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; hist &amp;lt;- isql(conn=conn, sql=getSQL(&amp;#39;StudentSummary&amp;#39;))
Interactive SQL mode (type quit to exit, help for available commands)...
SQL&amp;gt;
help
   Command      Description
   ___________  ______________________________________________________
   quit         quit interactive mode
   help         display this message
   sql          enter SQL statement
   edit         edit SQL in a separate text window
   print        print the last entered SQL statement
   exec         execute that last entered SQL statement
   result       prints the last results
   save [name]  save the last executed query to the global environment
SLQ&amp;gt;
print
SELECT CreatedDate, count(StudentId) AS count FROM students GROUP BY CreatedDate ORDER BY CreatedDate
SLQ&amp;gt;
edit&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src='/images/isql-edit-window.png' alt='SQL Edit Window' /&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;SLQ&amp;gt;
print
SELECT CreatedDate, count(StudentId) AS count FROM students GROUP BY CreatedDate ORDER BY CreatedDate
SLQ&amp;gt;
exec
Executing SQL...
118 rows of 2 variables returned
SLQ&amp;gt;
save
Data frame sql.results saved to global environment
SLQ&amp;gt;
quit&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;isql&lt;/code&gt; function returns the history of the session invisibly (that is the results will not be printed but can be assigned to a variable). There are two elements in the returned list, &lt;code&gt;commands&lt;/code&gt; is a character vector listing all the commands entered and &lt;code&gt;sql&lt;/code&gt; is a character vector containing all the SQL statements entered.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; names(hist)
[1] &amp;quot;sql&amp;quot;      &amp;quot;commands&amp;quot;&lt;/code&gt;&lt;/pre&gt; &lt;a href='http://jason.bryer.org/posts/2013-01-12/Interactive_SQL_in_R.html'&gt;Read full post...&lt;/a&gt;</content>
  </entry>
  
  <entry>
    <id>http://jason.bryer.org/posts/2013-01-10/Function_for_Reading_Codebooks_in_R</id>
    <link type="text/html" rel="alternate" href="http://jason.bryer.org/posts/2013-01-10/Function_for_Reading_Codebooks_in_R.html"/>
    <title>Reading Codebook Files in R</title>
    <published>2013-01-10T00:00:00-08:00</published>
    <updated>2013-01-10T00:00:00-08:00</updated>
    <author>
      <name>Jason Bryer</name>
      <uri>http://jason.bryer.org/</uri>
    </author>
    <content type="html">&lt;p&gt;One issue I continuously encounter when starting to work with a new dataset is that of the codebook. In general, I prefer to load a codebook into R like any other data source, specifically as a data frame. And ideally, one data frame to provides the variable names with descriptions and any other meta data available, and a separate list of named vectors that can be used to recode factors. Although there is no standard format for codebooks, most follow a similar format. This post outlines the &lt;a href='https://gist.github.com/4497585'&gt;&lt;code&gt;parse.codebook&lt;/code&gt;&lt;/a&gt; function that will read codebooks that have the following features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each line in the file provides information about a variable (which I refer to as a variable row), or the mapping of factor (which I refer to as a level row).&lt;/li&gt;

&lt;li&gt;Variable rows start on the left edge (that is, there is a non-whitespace character at position 1 of the row).&lt;/li&gt;

&lt;li&gt;Level rows do not start on the left edge (that is, there is a whitespace character at position 1 of the row, for example a tab or space).&lt;/li&gt;

&lt;li&gt;Rows are either fixed (see &lt;code&gt;?read.fwf&lt;/code&gt; for more information as to specifics) or character delimited (e.g. comma, colon, etc.).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Although all codebooks may not strictly adhere to these rules, it is often trivial, even if not a bit tedious, to reformat the file to adhere to these rules. Also, blank lines are permissible and will simply be ignored.&lt;/p&gt;

&lt;p&gt;If the codebook file adheres to these rules, the &lt;code&gt;parse.codebook&lt;/code&gt; function will parse the file and return an object of type &lt;code&gt;codebook&lt;/code&gt; that inherits from &lt;code&gt;data.frame&lt;/code&gt;, therefore all the data frame functions are valid (e.g. &lt;code&gt;head&lt;/code&gt;, &lt;code&gt;nrow&lt;/code&gt;, &lt;code&gt;names&lt;/code&gt;, etc.). This data frame contains all the information about the variables vis-a-vis the variable rows. Information about factor levels are stored in a &lt;code&gt;list&lt;/code&gt; as an &lt;code&gt;attribute&lt;/code&gt; of the returned object which can be retrieved using &lt;code&gt;attr(mycodebook, &amp;#39;levels&amp;#39;)&lt;/code&gt;. Example from the &lt;a href='http://nces.ed.gov/ccd/'&gt;Common Core of Data&lt;/a&gt; and the &lt;a href='http://www.census.gov/acs/www/'&gt;American Community Survey&lt;/a&gt; are provided below.&lt;/p&gt;

&lt;h4 id='installation'&gt;Installation&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;source.codebook&lt;/code&gt; function is currently provided on &lt;a href='https://gist.github.com/4497585'&gt;Gist&lt;/a&gt;. You can either download the R script file or source it directly from Gist using the &lt;code&gt;devtools&lt;/code&gt; package.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;require(devtools)
source_gist(4497585)&lt;/code&gt;&lt;/pre&gt;

&lt;h5 id='parameters'&gt;Parameters&lt;/h5&gt;

&lt;p&gt;The &lt;code&gt;parse.codebook&lt;/code&gt; has a number of parameters to indicate the format of variable and level rows. The function will handle both character delimited rows and fixed with rows. Therefore, either &lt;code&gt;var.sep&lt;/code&gt; or &lt;code&gt;var.widths&lt;/code&gt; must be specified as well as &lt;code&gt;level.sep&lt;/code&gt; or &lt;code&gt;level.widths&lt;/code&gt;. The available parameters are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;file&lt;/code&gt; codebook file name.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;var.names&lt;/code&gt; the name of the columns for variable rows.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;level.names&lt;/code&gt; the name of the columns for level rows.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;var.sep&lt;/code&gt; the separator for variable rows.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;level.sep&lt;/code&gt; the separator for level rows.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;level.indent&lt;/code&gt; character vector providing character(s) at the beginning of the line that indicate the line represents a factor level. Each element should have 1 character as only the first character of the line is compared.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;var.name&lt;/code&gt; the name in &lt;code&gt;var.names&lt;/code&gt; that represents the variable name. This should be a valid R variable name as this will be the column name in the corresponding data file, as well as the name used in the &lt;code&gt;list&lt;/code&gt; of levels stored as an attribute to the returned object.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id='example_one_common_core_of_data'&gt;Example One: Common Core of Data&lt;/h4&gt;

&lt;p&gt;The &lt;a href='http://nces.ed.gov/ccd/'&gt;Common Core of Data&lt;/a&gt; (CCD) is a dataset provided by the &lt;a href='http://nces.ed.gov/'&gt;National Center for Education Statistics&lt;/a&gt; that provides information about K-12 schools in the United States. The codebook provided is in plain text and required two modifications: One, general file information at the top of the file was deleted, and two, any descriptions that spanned lines need to be modified so the are on only one line. Here are the first 15 lines of the modified file, the full file can be downloaded at &lt;a href='http://jason.bryer.org/codebooks/ccdCodebook.txt'&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;SURVYEAR      1      AN     Year corresponding to survey record.

NCESSCH       2      AN     Unique NCES public school ID (7-digit NCES agency ID (LEAID) + 5-digit NCES school ID (SCHNO).    

FIPST         3      AN     American National Standards Institute (ANSI) state code..

                             01  =  Alabama        
                             02  =  Alaska          
                             04  =  Arizona
                             05  =  Arkansas       
                             06  =  California      
                             08  =  Colorado
                             09  =  Connecticut    
                             10  =  Delaware        
                             11  =  District of Columbia&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This codebook uses fixed withs for variable rows, and separators (using the equal sign) for level rows (although it also possible to use fixed with for level rows as well). First, we will parse the file:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;ccd.codebook &amp;lt;- parse.codebook(&amp;#39;ccdCodebook.txt&amp;#39;, 
				var.names=c(&amp;#39;variable&amp;#39;,&amp;#39;order&amp;#39;,&amp;#39;type&amp;#39;,&amp;#39;description&amp;#39;),
				level.names=c(&amp;#39;level&amp;#39;,&amp;#39;label&amp;#39;),
				level.sep=&amp;#39;=&amp;#39;, 
				var.widths=c(13, 7, 7, Inf) )&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here are the first six rows of the returned data frame.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; head(ccd.codebook)
  linenum variable order type                                                                                    description isfactor
1       1 SURVYEAR     1   AN                                                           Year corresponding to survey record.    FALSE
2       3  NCESSCH     2   AN Unique NCES public school ID (7-digit NCES agency ID (LEAID) + 5-digit NCES school ID (SCHNO).    FALSE
3       5    FIPST     3   AN                                      American National Standards Institute (ANSI) state code..     TRUE
4      67    LEAID     4   AN                                                          NCES local education agency (LEA) ID.    FALSE
5      69    SCHNO     5   AN                                                                                NCES school ID.    FALSE
6      71     STID     6   AN                                                       State?s own ID for the education agency.    FALSE&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In addition to the columns corresponding to &lt;code&gt;var.names&lt;/code&gt;, the function also returns a &lt;code&gt;linenum&lt;/code&gt; and &lt;code&gt;isfactor&lt;/code&gt; column. The former is an integer corresponding to the line number in the original file from which this row was parsed. This is useful for tracking down issues in the parsing or text formatting. The &lt;code&gt;isfactor&lt;/code&gt; is a logical column indicating whether there are factor levels specified for that variable. Factor levels can be retrieved as follows:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; ccd.var.levels &amp;lt;- attr(ccd.codebook, &amp;#39;levels&amp;#39;)
&amp;gt; names(ccd.var.levels)
[1] &amp;quot;FIPST&amp;quot;  &amp;quot;TYPE&amp;quot;   &amp;quot;STATUS&amp;quot; &amp;quot;TITLEI&amp;quot; &amp;quot;STITLI&amp;quot; &amp;quot;MAGNET&amp;quot; &amp;quot;CHARTR&amp;quot; &amp;quot;SHARED&amp;quot;
&amp;gt; ccd.var.levels[[&amp;#39;TYPE&amp;#39;]]
  linenum level                    label
1     103     1           Regular school
2     105     2 Special education school
3     107     3        Vocational school
4     109     4 Other/alternative school
5     111     5       Reportable program&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id='example_two_american_community_survey'&gt;Example Two: American Community Survey&lt;/h4&gt;

&lt;p&gt;The &lt;a href='http://www.census.gov/acs/www/'&gt;American Community Survey&lt;/a&gt; is the current version of the Census Long Form. The codebook provided by the United Census Bureau is in PDF format, but is easily converted to a plain text file. This file required more modification that the CCD file described above, mostly removing line numbers that pasted over from the PDF as well as ensuring that descriptions did not span lines. The final modified version can be downloaded (here)&lt;span&gt;http://jason.bryer.org/codebook/acsPersonCodebook.txt&lt;/span&gt;. Here are the first 10 lines of the file:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;SPORDER .Person number
ST .State Code
	01 .Alabama/AL
	02 .Alaska/AK
	04 .Arizona/AZ
	05 .Arkansas/AR
	06 .California/CA
	08 .Colorado/CO
	09 .Connecticut/CT
	10 .Delaware/DE&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For this codebook file, all rows are character delimited on &lt;code&gt; .&lt;/code&gt; (space period). We parse the file as follows:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;acs.codebook &amp;lt;- parse.codebook(&amp;#39;acsPersonCodebook.txt&amp;#39;, 
				   var.names=c(&amp;#39;var&amp;#39;,&amp;#39;desc&amp;#39;), 
				   level.names=c(&amp;#39;level&amp;#39;,&amp;#39;label&amp;#39;),
				   var.sep=&amp;#39; .&amp;#39;, level.sep=&amp;#39; .&amp;#39;)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first six lines of the returned data frame are:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; head(acs.codebook)
      var                                                                                desc linenum isfactor
1 SPORDER                                                                       Person number       1    FALSE
2      ST                                                                          State Code       2     TRUE
3  ADJINC Adjustment factor for income and earnings dollar amounts (6 implied decimal places)      55    FALSE
4   PWGTP                                                                     Person&amp;#39;s weight      56    FALSE
5    AGEP                                                                                 Age      57    FALSE
6     CIT                                                                  Citizenship status      58     TRUE&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And factor levels:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; var.levels &amp;lt;- attr(acs.codebook, &amp;#39;levels&amp;#39;)
&amp;gt; names(var.levels)
 [1] &amp;quot;ST&amp;quot;      &amp;quot;CIT&amp;quot;     &amp;quot;COW&amp;quot;     &amp;quot;DRAT&amp;quot;    &amp;quot;ENG&amp;quot;     &amp;quot;GCM&amp;quot;     &amp;quot;JWRIP&amp;quot;   &amp;quot;JWTR&amp;quot;    &amp;quot;MAR&amp;quot;     &amp;quot;MARHM&amp;quot;  
[11] &amp;quot;MARHT&amp;quot;   &amp;quot;MARHW&amp;quot;   &amp;quot;MIG&amp;quot;     &amp;quot;MIL&amp;quot;     &amp;quot;NWAV&amp;quot;    &amp;quot;RELP&amp;quot;    &amp;quot;SCH&amp;quot;     &amp;quot;SCHG&amp;quot;    &amp;quot;SCHL&amp;quot;    &amp;quot;SEX&amp;quot;    
[21] &amp;quot;WKL&amp;quot;     &amp;quot;WKW&amp;quot;     &amp;quot;WRK&amp;quot;     &amp;quot;ANC&amp;quot;     &amp;quot;ANC1P&amp;quot;   &amp;quot;ANC2P&amp;quot;   &amp;quot;DECADE&amp;quot;  &amp;quot;DIS&amp;quot;     &amp;quot;DRIVESP&amp;quot; &amp;quot;ESP&amp;quot;    
[31] &amp;quot;ESR&amp;quot;     &amp;quot;FOD1P&amp;quot;   &amp;quot;6402&amp;quot;    &amp;quot;FOD2P&amp;quot;   &amp;quot;HICOV&amp;quot;   &amp;quot;HISP&amp;quot;    &amp;quot;INDP&amp;quot;    &amp;quot;JWAP&amp;quot;    &amp;quot;JWDP&amp;quot;    &amp;quot;LANP&amp;quot;   
[41] &amp;quot;MIGSP&amp;quot;   &amp;quot;MSP&amp;quot;     &amp;quot;NAICSP&amp;quot;  &amp;quot;NOP&amp;quot;     &amp;quot;OCCP02&amp;quot;  &amp;quot;OCCP10&amp;quot;  &amp;quot;PAOC&amp;quot;    &amp;quot;POBP&amp;quot;    &amp;quot;POWSP&amp;quot;   &amp;quot;PRIVCOV&amp;quot;
[51] &amp;quot;PUBCOV&amp;quot;  &amp;quot;QTRBIR&amp;quot;  &amp;quot;RAC1P&amp;quot;   &amp;quot;RAC2P&amp;quot;   &amp;quot;RAC3P&amp;quot;   &amp;quot;SFN&amp;quot;     &amp;quot;SFR&amp;quot;     &amp;quot;SOCP00&amp;quot;  &amp;quot;SOCP10&amp;quot;  &amp;quot;VPS&amp;quot;    
[61] &amp;quot;WAOB&amp;quot;    &amp;quot;FHINS3C&amp;quot; &amp;quot;FHINS4C&amp;quot; &amp;quot;FHINS5C&amp;quot;
&amp;gt; var.levels[[&amp;#39;CIT&amp;#39;]]
  linenum level                                                                        label
1      59     1                                                             Born in the U.S.
2      60     2 Born in Puerto Rico, Guam, the U.S. Virgin Islands, or the Northern Marianas
3      61     3                                            Born abroad of American parent(s)
4      62     4                                               U.S. citizen by naturalization
5      63     5                                                    Not a citizen of the U.S.&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id='conclusion'&gt;Conclusion&lt;/h4&gt;

&lt;p&gt;Although a standard codebook format doesn&amp;#8217;t exist, most adopt a similar format. I have outlined the &lt;code&gt;parse.codebook&lt;/code&gt; function that, with minimal reformatting of the original codebook file, be used to read a codebook into R. This is tremendously useful as we can now merge in variable descriptions when creating tables and figures, as well as recode factors with their longer descriptions in an automated fashion.&lt;/p&gt; &lt;a href='http://jason.bryer.org/posts/2013-01-10/Function_for_Reading_Codebooks_in_R.html'&gt;Read full post...&lt;/a&gt;</content>
  </entry>
   
</feed>
