This post is obsolete, the content in this post is not correct, please see Gabor’s comment.
In the previous post Manipulate data in R via SQL interface, I introduced how to use RpgSQL as DB driver to interface sqldf and PostgreSQL.
But there are some problems when using RpgSQL.
> options(sqldf.driver = “pgSQL”)
> sqldf(“select Species from iris where Species like ‘set%’ limit 3″)
Error in .verify.JDBC.result(r, “Unable to retrieve JDBC result set for “, :
Unable to retrieve JDBC result set for select Species from iris where Species like ‘set%’ limit 3 (ERROR: column “species” does not exist
And RPostgreSQL works fine
> options(sqldf.driver = “PostgreSQL”)
> sqldf(“select Species from iris limit 3″)
> sqldf(“select Species from iris where Species like ‘set%’ limit 3″)
If you have trouble in installing RPostgreSQL under Mac OSX, please refer to Install R Package RPostgreSQL in Mac OSX.
I has been looking for a R package providing SQL-like interface for data manipulation. Though, there are already usable DBI/RDBI package we can use to talk to SQL interfacing database already, they are still not the convenient tools for manipulate data outside the database, like in-memory frames.
After several rounds of googling, I finally find sqldf package, an seamless interface between R’s frame objects and SQL flavors, provided by specific database back-ends, like SQLite (via RSQLite) and PostgreSQL (via RpgSQL).
Personally, I prefer PostgreSQL over RSQLite, because of the SQL language features. So I tried my best to make sqldf work on my Mac.
By looking at the code of sqldf package, it tries to use ‘RpgSQL’, ‘RMySQL’, ‘RH2′ and ‘RSQLite’ in order, so if you have RpgSQL installed, it will connect the ‘test’ database of your local postgreSQL instance by default.
Use PostgreSQL as Database Back-end
sqldf can detect the loaded RpgSQL packge and use it. If you have no other DBI package loaded, sqldf will use RSQLite as a fallback.
Configure RpgSQL on Mac
Configuration to make RpgSQL work might be not needed for other platforms, but in my Mac box with Mac OSX 10.5 it is needed.
1. Make sure postgresql-jdbc port is installed, or
sudo port instal postgresql-jdbc
2. Change jar file name
Without changing the jar file name, RpgSQL is unable to locate the jar file automatically.
sudo ln -s postgresql.jar postgresql.jdbc4.jar
3. Set R options
Tell R where to find the jar files.
4. Create ‘test’ database in your local PostgreSQL, or you can set R option RpgSQL.dbname, or provide dbname paramter each time calling
5. Everything is ready, run to test
sqldf('select * from iris limit 5')
Change sqldf driver
options(sqldf.driver = "")
I tested following drivers with simple
select * from iris limit 5 SQL.
- RpgSQL – there is something wrong with RpgSQL when the SQL contains factor fields, please see my next post Play sqldf with RPostgreSQL.
1. To avoid loading tcl/tk
options(gsubfn.engine = "R")
Recently, I am developing a website and need practical solution to setting up a real-life production web serving environment.
I googled and found following links that are useful for my final decision.
Other resources can be found also from the links contained in above articles.
Yet, I haven’t decide which app server and cache will be used, but obviously, nginx(uwsgi) + gevent powered app server + staticgenerator or django built-in cache middleware or varnish would be the most possible choose.
I installed a PostgreSQL 9.0 port in my puny Mac Pro, and I don’t want it to be a burden for the limited 2GB memory, so I modified following parameters in the postgresql.conf file:
shared_buffers = 2MB
max_connections = 10
Actually, I don’t need much concurrent connections in a development environment.
After the tuning, the virtual memory of each postgres process reduced to 31MB, much smaller than original 57MB.
There are at least four different ways to install PostgreSQL in your Mac.
Method 1: Install from source code
This is the most boring way to install PostgreSQL, you need to deal with all the things of resolving dependencies and compilation.
Method 2: Install via binary package provided by EnterpriseDB
This is the most quick way to make your PostgreSQL instance ready to serve. But it has constraints. The binary package is a self-contained package contains all dependencies and tools, so it will not live with other packages installed via MacPorts or Fink in your system. If you want to install client packages like py26-psycopg2, you will have to install another copy of PostgreSQL via MacPorts or Fink too for the dependence need.
Besides, if you want to enable PL/Python, there is no convenient way to do this.
Method 3: Install via Fink
Fink will resolve all the dependence problems for you and get your PostgreSQL installed at prefix
Method 4: Install via MacPorts
port install postgresql90-server
And follow the instructions displayed in the screen.
As there are several versions of PostgreSQL at MacPorts, you need to decide which one to install. I prefer the new features of pg90. After installation, the binary directory of PostgreSQL is not at the standard paths, like
/opt/local/bin, so I put it in
Or you will not have tools like pg_dump available in hand.
I am trying to use PostgreSQL for data warehousing data and R for interactive data exploration. So the first priority to make this happen is to connect PostgreSQL from R and read data back via customized SQL.
As discussed at stackoverflow, the RPostgreSQL seems to be the best choice for its maturity and standard compliance. But there is no official binary for Mac OSX, so I have to do it myself to make it work.
First of all, my PostgreSQL is installed via MacPorts. I am using the 8.4 version. And I am using the binary version of R downloaded from its homepage. I prefer the binary version over the MacPorts version because the binary version has a native GUI, and doesn’t depend on X11. The problem of installing RPostgreSQL comes along with the different sources between PostgreSQL and R.
I downloaded the RPostgreSQL source code from the project page and issue:
The configure script fails to find the libpq-fe.h file. That is because my PostgreSQL is installed at /opt/local/ prefix, which is not set to the default location for searching include and lib files at RPostgreSQL’s configure script.
Solution for the installation is simple. After reading the RPostgreSQL’s configure script. I found that I can use pg_config to let the configure script discover the installation of RPostgreSQL automatically. But there is still a problem that the pg_config doesn’t resides at ‘/opt/local/lib/postgresql84/bin’ which is not in the standard executable search paths. So, I made a temporary link of the pg_config, like
sudo ln -s /opt/local/lib/postgresql84/bin/pg_config /usr/local/bin/
Then re-issuing the above install command succeeded.
- You need to install RPostgreSQL package from source due to lack of precompiled binary in Mac OSX
- If your R and PostgreSQL are installed from difference Sources (like me, one from binary, the other via MacPorts) or R prompts that libpq-fe.h cannot be found, check your pg_config whether is in the executable search paths.
- If your pg_config cannot be searched, make it happen by simply making a soft link (my method) or put the path of pg_config into the PATH environment.
- Issue the ordinary install command in R environment.
Finally, to test whether the installation succeeded, type
Inspired by Dan Sanduleac’s post at stackless.com, I re-ran the benchmark between Cython and Boost.Python.
Basically, Cython is about 7x times faster than Boost.Python, which astonished me.
The benchmark was run at my 13 inch MacBook Pro, with OSX leopard. The softwares I used were all installed with MacPorts:
- boost: 1.42.0 (with +python26 enabled)
- Cython: 0.12
- Python 2.6
- cmake 2.8
I used cmake instead of boost-bjam, which is the recommended build tools for boost.python. I don’t think it would affect the result. When compiling, I had -O3 turned on.
As I said, the benchmark was initially carried out by Dan Sanduleac, but I thought his implementation had some problem, he used directly cython to do arithmetic operation, not the native c++. For fair comparison, we should use Cython and Boost.Python in the same way – binding the C++ python which implemented calculation logic, so that we measure the overhead of binding code much more precisely.