Data Loading in Postgres for Newbies

So you’re new to Postgres and you want to test it out and see how it works. Or maybe you’re building a new app and you need to seed your database. How do you get data in your Postgres database? There’s a lot out there on the topic of data migration for Postgres and some of it can be complicated and overwhelming. I’m going to give you some quick and simple ways to get test data IN your database.

csv

If you need to load data from csv files, the psql \copy command in Postgres is a utility that can come to your rescue. It copies data from a file on disk to a table as bulk data load. \copy works from the command line and you’ll need to be connected to the database using psql.

Let’s take an example of storm events data from NOAA. Downloading one of the files locally and unzipping it you’ll have a csv ready to copy.

Before you can load the data, you’ll need to create a table in Postgres to store the data and give names to the columns. To get the column headers without reading the whole file run the linux head command:

head StormEvents_locations-ftp_v1.0_d2019_c20220425.csv

Resulting in:

YEARMONTH,EPISODE_ID,EVENT_ID,LOCATION_INDEX,RANGE,AZIMUTH,LOCATION,LATITUDE,LONGITUDE,LAT2,LON2
201904,135109,809787,1,.28,"NNE","GENOA",33.3834,-93.9174,3323004,9355044

Now you’ll create a table definition for these. You can peek at the data to make an informed guess about the data types. You can always use text if you’re not sure.

CREATE TABLE weather_events
(YEARMONTH text, EPISODE_ID int, EVENT_ID int,
LOCATION_INDEX int, RANGE float, AZIMUTH text,
LOCATION text, LATITUDE float, LONGITUDE float,
LAT2 int, LON2 int);

Now \copy the data in. Make sure you run with csv header since your file has headers in it:

psql \copy weather_events FROM ~/Downloads/StormEvents_locations-ftp_v1.0_d2019_c20220425.csv WITH CSV header

And done!

JSON

There’s a lot of JSON data out in the world nowadays. An essential tool for working with JSON is the jq tool andI’ll show an example of working with a JSON file from NOAA.

In this case, there is no distinct column header, so we will just create a single JSON column and populate separate rows for each JSON array element. We will use the \copy command from earlier, this time using the FROM PROGRAM option which lets us read the output of a command instead of a data file. Using this approach, we can preprocess the JSON file using jq and load each row in the JSON array into a row of the table.

First make your table:

CREATE TABLE solar_load (data jsonb);

Then load the data:

psql \copy solar_load FROM PROGRAM 'jq -c -r .[] < ~/Downloads/solar_probabilities.json';

If you want to break up each row array into separate columns after this you can do something like this to create a new table with seperated columns:

CREATE TABLE solar_load_with_columns(date date, c_class_1_day int,
c_class_2_day int, c_class_3_day int, m_class_1_day int,
m_class_2_day int, m_class_3_day int, x_class_1_day int,
x_class_2_day int, x_class_3_day int, "10mev_protons_1_day" int,
"10mev_protons_2_day" int, "10mev_protons_3_day" int,
polar_cap_absorption text);

Followed by an insert to move the data over:

INSERT INTO solar_load_with_columns
SELECT (json_populate_record(NULL::solar_load_with_columns, data::json)).*
FROM solar_load;

GUIs

There are a lot of user interfaces for Postgres and if you’re a newbie you willl likely want to use one or more of these for working with Postgres. PgAdmin4 is a very popular one and it has some key features. The downside to PgAdmin imports is that you have to create the table and columns before you can load in data.

I’ve done quite a bit of testing with DBeaver and that tool does a lot of the work for you when you import data via csv. To import data into Postgres, you create a new server resource, connect your resource, create a database, and use the csv import tool and the table creations and columns are created for you.

The DBeaver import has worked pretty well for me with simple data inserts. It can get a little tricky when you’re loading large files and working with its batch import features, so watch out for that. In my testing, the psql \copy function performs better than DBeaver for big data loads.

Dump and Restore

pg_dump is a common tool for extracting data from your database. It will output your data definitions and your data in a raw SQL form which makes it convenient to move data from different databases, doing upgrades, or a basic level of backup (though dedicated backup tools are a better option here). There are a lot of configuration options for this, so check the postgres docs before you get in too far.

To run a pg_dump on your database:

pg_dump
postgres://postgres:vcnQ1Ay0etSwUzimGBlFJirZagzpg0hbXy3byrnhbHMZf0Yaug6@p.6z5h5nfdy3jsblmo2a5nz2a.db.postgresbridge.com:5432/postgres >
backup.sql;

From a GUI, you can normally run a dump command as well, which will result in the same file being generated.

To restore a database from a previous database:

create database restore;

psql -d
postgres://postgres:vcnQ1Ay0etSwUzimGBlFJirZagzpg0hbXy3byrnhbHMZf0Yaug6@p.6z5h5nfdy3jsblmo2a5nz2a.db.postgresbridge.com:5432/restore
< backup.sql

If you need help understanding these connection strings used above, check out the diagram and information we published in a tutorial on Postgres Database and Schemas.

Foreign Data Wrappers

Getting data into one database from another one is really easy using foreign data wrappers. The postgres_fdw will let you connect to an external postgres resource and query it and use it from another database. I’ll show you a quick setup.

  • The FOREIGN instance, the one that has the data.
  • The DESTINATION instance, the database querying foreign data.

On the destination, create the extension to allow you to connect to other databases:

CREATE EXTENSION IF NOT EXISTS postgres_fdw;

On the foreign side, create a new user for the destination server to connect as:

CREATE USER fdw_user WITH PASSWORD 'pizza1';
GRANT SELECT, INSERT, UPDATE,
DELETE ON TABLE solar_load_with_columns TO fdw_user;

On the destination side create the foreign server, which tells Postgres where to connect for the remote data:

CREATE SERVER foreigndb_fdw
FOREIGN DATA WRAPPER
postgres_fdw
OPTIONS (host 'p.vbjrfujv5beutaoelw725gvi3i.db.postgresbridge.com',
port '5432', dbname 'postgres');

On the destination side, create the user mapping. This is how Postgres knows which user on the foreign side to connect as. In this case, all users on the destination side will connect as the same user.

CREATE USER MAPPING for PUBLIC SERVER
foreigndb_fdw OPTIONS (user 'fdw_user', password 'pizza1');

On the destination side, import the schema and limit it to the table names you want. This makes it so Postgres has a local table definition that matches the remote table’s definition and can be queried on the destination server.

IMPORT FOREIGN SCHEMA "public"
LIMIT TO(solar_load_with_columns)
FROM SERVER foreigndb_fdw INTO public;

Now your destination can query the foreign table!

SELECT *
FROM solar_load_with_columns;

Generate Data with a Query

Sometimes it is useful to generate mock data in Postgres that is more specific for your data model. An example of some easy queries for making fake data:

To create a series of user ids, use something like:

CREATE TABLE mock_login AS
SELECT 'userid' || s AS username FROM generate_series(1,10000) AS s;

To create random time intervals along with the user ids do something like:

CREATE TABLE mock_login2 AS SELECT 'userid' || s AS username, now() -
random() * interval '10 days' AS last_login FROM generate_series(1,10000) AS s;

And to add in random md5 passwords, use something like:

CREATE TABLE mock_login3
AS SELECT 'userid' || s as username, md5(random()::text) AS password, now() -
random() * interval '10 days' AS last_login FROM generate_series(1,10000) AS s;

Ok, now we’ve got a seed database of usernames, passwords, and login dates.

SELECT * FROM mock_login3 LIMIT 5;

username | password | last_login
----------+----------------------------------+-------------------------------
userid1 | 774b0cc950b990755199479829f019da | 2022-10-31 02:11:53.26224+00
userid2 | 594ec00eb078f7d8a62150597a779b86 | 2022-10-24 05:17:23.53117+00
userid3 | fde356d4e1257ac9dc7afb50f25fed60 | 2022-10-28 16:55:13.853136+00
userid4 | 2b3166be39ad16699f6e78eeba370c0c | 2022-10-24 13:54:36.446941+00
userid5 | 16541644f364416fbbbb321f36a0e56a | 2022-10-27 11:24:40.139086+00
(5 rows)

If you’re going for data generation, there’s also some really fun tools out there for that, like Mockaroo and the Rails gem Faker.

pgbench

I see a lot of engineers do a quick data load using pgbench since it comes with the database as a separate binary and it's a quick way to get testing with Postgres, irrelevant of the content. pgbench is useful if you are testing the performance of the database itself without regard for what data is in it. By default pgbench will insert 100,000 rows into the pgbench_accounts table. You can also add in a scale factor (-s) to determine how large the sample data should be. So a scale factor of 100 would be 10,000,000 rows.

pgbench -i -s 100 -d
postgres://postgres:vcnQ1AgQwUzimGBlFJirZagzpg0hbXy3byrnhbHMZf0Yaug6@p.6z5h5nct5fdylmo2a5nz2a.db.postgresbridge.com:5432/postgres
SELECT COUNT(*) FROM pgbench_accounts;


  count
----------
 10000000
(1 row)

Load It!

Hopefully now you have some very basic data loading strategies ready to go forward and do some testing. If you’re looking for something to load Geospatial data, we covered that a while back in Loading Data into PostGIS: An Overview. I always encourage people to do this kind of work in the cloud so you can really see what Postgres can do when property is tuned and resourced. Try Crunchy Bridge out for a few minutes and you’ll see what I mean.

Elizabeth Christensen

Written by

Elizabeth Christensen

November 1, 2022 More by this author