Home > PostgreSQL, R, Statistics > Manipulate data in R via SQL interface

Manipulate data in R via SQL interface

I has been looking for a R package providing SQL-like interface for data manipulation. Though, there are already usable DBI/RDBI package we can use to talk to SQL interfacing database already, they are still not the convenient tools for manipulate data outside the database, like in-memory frames.

After several rounds of googling, I finally find sqldf package, an seamless interface between R’s frame objects and SQL flavors, provided by specific database back-ends, like SQLite (via RSQLite) and PostgreSQL (via RpgSQL).

Personally, I prefer PostgreSQL over RSQLite, because of the SQL language features. So I tried my best to make sqldf work on my Mac.

Installation


install.packages('RpgSQL')
install.packages('sqldf')

By looking at the code of sqldf package, it tries to use ‘RpgSQL’, ‘RMySQL’, ‘RH2′ and ‘RSQLite’ in order, so if you have RpgSQL installed, it will connect the ‘test’ database of your local postgreSQL instance by default.

Use PostgreSQL as Database Back-end


library(RpgSQL)

sqldf can detect the loaded RpgSQL packge and use it. If you have no other DBI package loaded, sqldf will use RSQLite as a fallback.

Configure RpgSQL on Mac

Configuration to make RpgSQL work might be not needed for other platforms, but in my Mac box with Mac OSX 10.5 it is needed.

1. Make sure postgresql-jdbc port is installed, or

sudo port instal postgresql-jdbc

2. Change jar file name

Without changing the jar file name, RpgSQL is unable to locate the jar file automatically.


cd /opt/local/share/java/
sudo ln -s postgresql.jar postgresql.jdbc4.jar

3. Set R options

Tell R where to find the jar files.


options(RpgSQL_JAR='/opt/local/share/java/')

4. Create ‘test’ database in your local PostgreSQL, or you can set R option RpgSQL.dbname, or provide dbname paramter each time calling sqldf

5. Everything is ready, run to test

sqldf('select * from iris limit 5')

Change sqldf driver


options(sqldf.driver = "")

I tested following drivers with simple select * from iris limit 5 SQL.

Other options

1. To avoid loading tcl/tk

options(gsubfn.engine = "R")

Advertisement
Categories: PostgreSQL, R, Statistics Tags: , , , ,
  1. Gabor Grothendieck
    October 2, 2010 at 11:58 AM | #1

    Thanks for pointing these items out. Version 0.1-3.2 of the RpgSQL driver
    http://cran.r-project.org/web/packages/RpgSQL/index.html
    now has the following features which should mean your points 2 and 3 are no longer needed:

    (1) looks into /opt/local/share/java/ (among other places) and

    (2) looks for postgresql*.jar if “postgresql*.jdbc4.jar” not found

    (3) Also the option that was RpgSQL_JAR is now RpgSQL.JAR to be consistent with RpgSQL.dbname and RpgSQL.password; however, the environment variable continues to be RpgSQL_JAR.

    (4) Its now possible to specify either (i) the complete path and filename in the RpgSQL.JAR option or in the RpgSQL_JAR environment variable or (ii) one can specify just the pathname. If one specifies just the pathname it first looks for jar files with name of the form postgresql*.jdbc4.jar and if not found it looks for jar files with name of the form postgresql*.jar .

  1. September 28, 2010 at 4:51 AM | #1

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.