Using PostgreSQL and SQL to Randomly Sample Data
In the last post of this series we introduced trying to model fire probability in Northern California based on weather data. We showed how to use SQL to do data shaping and preparation. We ended with a data set that was ready with all the fire occurrences and weather data in a single table almost prepped for logistic regression.
There is now one more step: sample the data. If you have worked with logistic regression before you know you should try to balance the number of occurrences (1) with absences (0). To do this we are going to sample out from the non_fire_weather equal to the count in fire_weather and then combine them into one table.
a picture showing a random sample of figures
Picture by OpenLearn Create
I thought for sure I was going to have to write pl/pgsql or pl/python to do this next task. As I tell people in my talks/workshops, “Start with Postgres until it doesn’t work for your use case. Then go back and read the Postgres doc.” Taking my own advice, I found a way to make this work with SQL. Let’s do it together below. I’m gonna spin up a small instance in Crunchy Bridge to do this work.
Sampling the non-fire days
First we sample as many non_fire_weather records as there are in count of records in the fire_weather table. There are Postgres built-in functions for sampling tables (look for keyword TABLESAMPLE in the FROM clause). There are two built in functions and the documentation does a good job of explaining them:
The BERNOULLI method scans the whole table and selects or ignores individual rows independently with the specified probability. The SYSTEM method does block-level sampling with each block having the specified chance of being selected; all rows in each selected block are returned. The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.
You can pass a seed number as a parameter to the either method to guarantee repeatability of sampling between different calls to the query.
It’s important to note that:
This sampling precedes the application of any other filters such as WHERE clauses.
It is also important to note that neither method guarantees to return the exact number of rows requested.
To get the exact number sample, we need to load an extension called tsm_system_rows. With tsm_system_rows we get the exact number of rows we requested (unless there are fewer rows in the table than requested). But again the caveats are important:
Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.
Tsm_system_rows does not accept a seed number so two calls of the SQL function are not guaranteed to return the same set of records.
For our use case, I decided that getting the exact number is important and I did not think clustering would be an issue. A good test is to run the sampling below with the bernoulli method and the tsm_system_rows method and look for an increase in autocorrelation in our predictor variable for the tsm_system_rows.
For now, let’s go ahead and add the extension:
CREATE EXTENSION tsm_system_rows;
Now we use a CTE and lateral join to get the data we want and put it into a table named “preanalysisdata”:
WITH count*fire AS ( SELECT count(*) AS thecount FROM fire*weather ) SELECT a.* INTO preanalysisdata FROM count_fire CROSS JOIN LATERAL (SELECT * FROM non_fire_weather TABLESAMPLE SYSTEM_ROWS(count_fire.thecount)) AS a;
- The CTE is just getting us the count of records in the fire table.
- We then use a lateral join in the second part of the query to pass the count number from the CTE into the subquery at the end.
- The subquery uses the count of fire rows to randomly sample the exact same number of non-fire weather days.
- Once that lateral join finishes, the query then passes all the rows to the first part of the select query and puts the results into a new table.
We now have our non-fire data subsample that was randomly sampled from all the non-fire weather data put into a table.
Adding the fire data
Now we use a simple SQL UNION to concatenate the preanalysis data (no fires) with our fire data set to give us the data that is ready for analysis.
SELECT * INTO analysisdata FROM preanalysisdata UNION SELECT * FROM fire_weather;
It's time for the final step of separating the data into training and validation sets. It is quite easy to want to focus on how well your statistical or data science model does with prediction of its training data. Unfortunately, by doing this you often end up tuning your model to be specific to the data used, ruining the model's ability to predict new data. In data science you often want to “hold back” some of your data to test how good your model is at predicting new data. With our dataset we are going to do 90% for training and 10% for validation. With PostgreSQL, this is as easy as two lines of code.
To separate our authoritative data for analysis we will make a new schema. This way we can give other data scientists read but NOT write permissions to this schema. By separating our final data we can be sure the data will not be accidentally altered by someone else:
CREATE SCHEMA final;
In writing the next lines of SQL I decided to go with simplicity over generality. I could have made these lines more general by using the CTE and expression types found above. Doing so would have allowed the query to work for any table size, but instead I manually calculated the 90% and 10% values for records and used them in the query.
First we get the training/analysis data:
SELECT * INTO final.analysis FROM analysisdata TABLESAMPLE SYSTEM_ROWS(2525);
Ninety percent of the original records equals 2525 records. Again we use the system_rows extension to randomly sample rows from the table. We also use “select into” to create the analysis table in the final schema.
Finally, we need to put the remaining rows into the validation table. Stated in plain English, I want all the rows from analysis data that are not in final.analysis. My first inclination was to write the query like this - please understand this is WRONG:
SELECT * INTO final.verification FROM analysisdata AS a, final.analysis AS fa WHERE a.id != fa.id;
This actually does a cross join (also called cartesian product), for all the ids that are not equal between the tables.
There are several different SQL forms we could use to get the right answer. I chose this one because it had the best performance and it is the most “relational” style answer:
SELECT * INTO final.verification FROM analysisdata EXCEPT SELECT * FROM final.analysis;
I also think reading this query makes it quite clear what we want for the outcome. You can read more about ‘except’ in the official documentation. You can check out this blog post where I give a discussion of how I got to this SQL.
I was really excited to find the ability to randomly sample a table right there in PostgreSQL. Again, I thought I was definitely going to have to write some pl/pgsql, pl/python, pl/r, or do it in the client code. Instead I can write some simple SQL and make generic sampling functions in one SQL call. If I wanted to I could have even passed a seed number into the sampling function to sample the exact same rows every time.
And with that we have finished breaking out our training and verification. Next step we are going to center and standardize the predictive variables we want to use in the logistic regression.
You could do all this simply by spinning up a PostgreSQL instance in Crunchy Bridge and use the data from the Github repo. Did you know about the table sampling function in SQL? Does it also bring you joy? Tell us more about how you have used table sampling functions. Leave a comment below or reach out to us on Twitter. Happy data sciencing!
October 27, 2020 •More by this author