Using R in Postgres for Logistic Regression Modeling
Greetings friends! We have finally come to the point in the Postgres for Data Science series where we are not doing data preparation. Today we are going to do modeling and prediction of fire occurrence given weather parameters… IN OUR DATABASE!
Quick recap:
- We found some data on historical fires and historical weather data for California.
- We fixed up the data and merged it to make it more appropriate for logistic regression.
- We randomly sampled non-fire data and split the data into training and verification data sets.
- We put our final data for analysis into a PostgreSQL schema named “final”.
- Finally, we standardized and centered our weather parameters in new columns in our table.
For those of you not in statistics or data science, this may seem like a lot of work just to run a logistic regression. But as people in those fields will tell you, cleaning, shaping, and transforming your data for analysis is actually where you spend the majority of your time.
Statistical modelling with PL/R
For those who have been living under a rock and haven’t heard of R, it is a open-source programming language focused on statistics and data science with a HUGE ecosystem behind it. While it may look strange to you - it’s extremely well written for matrix calculations and statistical analysis.
As we have seen before, PostgreSQL has the ability to write functions in different programming languages. My colleague at Crunchy Data, Joe Conway, did just that for R in an embedded language PL/R. I don’t have time or space to teach PL/R in a blog post but there are a lot of good materials, including a hands-on tutorial and presentations. Today I will mostly show the flow a data scientist can use and some more interesting properties of PL/R.
Getting data from PostgreSQL in and out of R
One of the best properties Joe built into the language is RPostgresql compatibility support. If you work in R, the driver you use to connect to PostgreSQL is RPostgresql. What Joe did was stub out all the RPostgresql connection methods in PL/R to become noops functions. Instead, PL/R assumes it is connecting to the current database. As the documentation says:
These functions nominally work like their RPostgreSQL counterparts except that all queries are performed in the current database. Therefore all driver and connection related parameters are ignored, and dbDriver, dbConnect, dbDisconnect, and dbUnloadDriver are no-ops.
The implication of this for you, dear reader, is that you can write your normal RPostgresql R code on your desktop, and then almost drop it straight into PL/R function. Brilliant!
This pattern above is exactly what I did for my function development. I used my favorite (and best) IDE in the whole world, IDEA, with its R plugin to do all my R work on my desktop. I just used RPostgresql to connect to the DB where I would be creating my function.
Creating the logistic regression model
Since my main purpose was to show the use of PostgreSQL in a data science/statistical analysis workflow, I did not go through the normal model fitting techniques. Instead, I looked at the independent variables that would seem related to fire risk:
- Minimum relative humidity
- Maximum air temperature
- Precipitation
- Solar Radiation
The parameters and overall model were all significant at p < 0.05 so I deemed it “good enough”.
Onward and upward!
Putting the model into PL/R
The next phase is to take the model off my desktop and put it on the Postgres server. Given the work Joe did before on RPostgresql compatibility, I made 0 changes to my R code when creating the PL/R function. Let’s go through the steps I did to create the function.
The first step was to add the PL/R extension to my database. I need to do this command as a superuser (typically the user “postgres”). I will also need to add the function as a superuser since PL/R is an untrusted language. Note, this tutorial actually teaches you how to make PL/R a trusted language, but please don’t do that on your production DBs unless you know what you are doing.
CREATE EXTENSION plr;
And with that we are ready to add our function.
Turn my R code into a Postgres function
The code for the function is a standard template for a function (if you are not familiar with functions in PostgreSQL, here is a nice tutorial). Here is the complete code for my embedded function. In this blog post I am going to break it down by topical areas.
Function Signature
The function signature is a typical Postgres function signature.
CREATE OR REPLACE FUNCTION final.logistic_model()
RETURNS bytea as $$
- We create this function in the
final
schema, which is also where the data is located. - We are not passing in any data because we will get the table data through the R code.
- We are going to return bytea because we are actually going to store the logistic model created in R into another table.
R PostgreSQL Connection and Data Code
To begin our R function we need to get our data out of the table and into R. Remember, all of this connection and data loading code was just from me working in R on my desktop. The PL/R extension will actually “translate” all of these calls into direct db calls.
- We load the library for the R, RPostgreSQL, which will handle all the connections and queries.
- The next two lines load the driver and make the connection to our Postgres instance.
- The following line uses the connection (
con
) to do a SQL query against our database and put the results in a R dataframe (df
).
So once we make this into PL/R none of these connection calls actually use RPostgreSQL, they use the PL/R versions which are no-ops. And it executes the dbGetQuery
against the current database.
Executing R Code
Other than the data query, our function only executes one line of R code. We use the standard way to do logistic regression in R.
logmodel_solar <- glm(hasfire ~ cs_rh_min + cs_air_max_temp + cs_precip + cs_solar, data=df, family = binomial("logit"))
- Use the glm function to carry out a generalized linear model.
hasfire
is the dependent variable (coded as 1 for fire that day and 0 for no fire)- The dependent variable is going to be linearly related to 4 independent variables
- Use the dataframe,
df
, as the source of all the data values - Use the link function of type binomial logit. Using this link type changes the predicted values from any real number in ordinary linear regression, to the 1,0 in logistic regression.
- Save the output into an object named
logmodel_solar
.
Included in logmodel_solar
is not only the significance of the overall model but the actual parameter values for each independent variable. This will be important later in the blog.
Cleaning Up and Returning
The next two lines release the connection to the database (a no-op in PL/R) and unload the driver. Finally, we return logmodel_solar
. Now you may be wondering, what happens to an R object when we pass it back to PostgreSQL. The PL/R doc clearly lays out the translation between R and PostgreSQL data types. In our case, since we are returning an R object, PL/R will convert that to a bytea type in PostgreSQL.
How to Use the Function
With our spiffy function in hand, how do we use it? Well it’s as simple as a SQL call. We are actually going to combine running the function with saving its model to a table. Our table for holding the function output only has the following structure:
final.logistic_models
id
: self-incrementing integer column to act as the primary keyname
: text field that can hold a name for the modelmodel
: bytea column to hold the model objectcreated_at
: timestamp column which defaults to now()
So running our logistic regression model and creating a new table record is as simple as:
INSERT INTO final.logistic_models (name, model) VALUES ('first run', final.logistic_model());
With that in place, we could actually run a job every week to refit the logistic regression model based on newly input data. As a data scientist I can pull the more recent model back out of the database and look at the model object in R on my desktop. I could do this without even having access to the original data.
Wrap Up
In today’s post we demonstrated how a data scientist can use R with PostgreSQL on their desktop and server-side. The great part was my R code required 0 changes to work in PL/R.
This scenario works well for a company that wants the data scientist working on the statistical algorithm and then, when they finished, expose it in a “safe” way to anyone using the database. By embedding it in PostgreSQL you get four immediate benefits:
- The data scientist's R code doesn’t have to be translated to a language that the DBA or application developer understands.
- Since the code is running on server with the formula and where the data is stored, your program avoids a whole round trip over the network to do the data analysis.
- You get a nice clear separation of concerns; the data scientist gets to build and tune the statistical algorithm and the consumer of the output has no way to alter them.
- As long as your functions are ready only, with Postgres read replicas, you get easy horizontal scaling of your algorithm. If you want to give it a try, setting up read replicas is trivial to do on a platform like Crunchy Bridge (which also has PL/R installed).
If you are wondering how an application developer can take advantage of this, tune into the final blog post in the series.
What do you have to say about this? Do you use R in your daily work? Have you ever tried PL/R to do your work? I would love to hear about your experience, share it with us on Twitter. Happy coding and data sciencing!