Manipulate data in R via SQL interface
I has been looking for a R package providing SQL-like interface for data manipulation. Though, there are already usable DBI/RDBI package we can use to talk to SQL interfacing database already, they are still not the convenient tools for manipulate data outside the database, like in-memory frames.
After several rounds of googling, I finally find sqldf package, an seamless interface between R’s frame objects and SQL flavors, provided by specific database back-ends, like SQLite (via RSQLite) and PostgreSQL (via RpgSQL).
Personally, I prefer PostgreSQL over RSQLite, because of the SQL language features. So I tried my best to make sqldf work on my Mac.
Installation
install.packages('RpgSQL')
install.packages('sqldf')
By looking at the code of sqldf package, it tries to use ‘RpgSQL’, ‘RMySQL’, ‘RH2′ and ‘RSQLite’ in order, so if you have RpgSQL installed, it will connect the ‘test’ database of your local postgreSQL instance by default.
Use PostgreSQL as Database Back-end
library(RpgSQL)
sqldf can detect the loaded RpgSQL packge and use it. If you have no other DBI package loaded, sqldf will use RSQLite as a fallback.
Configure RpgSQL on Mac
Configuration to make RpgSQL work might be not needed for other platforms, but in my Mac box with Mac OSX 10.5 it is needed.
1. Make sure postgresql-jdbc port is installed, or
sudo port instal postgresql-jdbc
2. Change jar file name
Without changing the jar file name, RpgSQL is unable to locate the jar file automatically.
cd /opt/local/share/java/
sudo ln -s postgresql.jar postgresql.jdbc4.jar
3. Set R options
Tell R where to find the jar files.
options(RpgSQL_JAR='/opt/local/share/java/')
4. Create ‘test’ database in your local PostgreSQL, or you can set R option RpgSQL.dbname, or provide dbname paramter each time calling sqldf
5. Everything is ready, run to test
sqldf('select * from iris limit 5')
Change sqldf driver
options(sqldf.driver = "")
I tested following drivers with simple select * from iris limit 5 SQL.
- RpgSQL – there is something wrong with RpgSQL when the SQL contains factor fields, please see my next post Play sqldf with RPostgreSQL.
- RPostgreSQL
- RSQLite
Other options
1. To avoid loading tcl/tk
options(gsubfn.engine = "R")
Thanks for pointing these items out. Version 0.1-3.2 of the RpgSQL driver
http://cran.r-project.org/web/packages/RpgSQL/index.html
now has the following features which should mean your points 2 and 3 are no longer needed:
(1) looks into /opt/local/share/java/ (among other places) and
(2) looks for postgresql*.jar if “postgresql*.jdbc4.jar” not found
(3) Also the option that was RpgSQL_JAR is now RpgSQL.JAR to be consistent with RpgSQL.dbname and RpgSQL.password; however, the environment variable continues to be RpgSQL_JAR.
(4) Its now possible to specify either (i) the complete path and filename in the RpgSQL.JAR option or in the RpgSQL_JAR environment variable or (ii) one can specify just the pathname. If one specifies just the pathname it first looks for jar files with name of the form postgresql*.jdbc4.jar and if not found it looks for jar files with name of the form postgresql*.jar .