We previously looked at the popular DBSCAN spatial clustering algorithm, that builds clusters off of spatial density.

This post explores the features of the PostGIS ST_ClusterKMeans function. K-means clustering is having a moment, as a popular way of grouping very high-dimensional LLM embeddings, but it is also useful in lower dimensions for spatial clustering.

ST_ClusterKMeans will cluster 2-dimensional and 3-dimensional data, and will also perform weighted clustering on points when weights are provided in the "measure" dimension of the points.

To try out K-Means clustering we need some points to cluster, in this case the 1:10M populated places from Natural Earth.

Download the GIS files and load up to your database, in this example using ogr2ogr.

```
ogr2ogr \
-f PostgreSQL \
-nln popplaces \
-lco GEOMETRY_NAME=geom \
PG:'dbname=postgres' \
ne_10m_populated_places_simple.shp
```

A simple clustering in 2D space looks like this, using 10 as the number of clusters:

```
CREATE TABLE popplaces_geographic AS
SELECT geom, pop_max, name,
ST_ClusterKMeans(geom, 10) OVER () AS cluster
FROM popplaces;
```

Note that pieces of Russia are clustered with Alaska, and Oceania is split up. This is because we are treating the longitude/latitude coordinates of the points as if they were on a plane, so Alaska is very far away from Siberia.

For data confined to a small area, effects like the split at the dateline do not matter, but for our global example, it does. Fortunately there is a way to work around it.

We can convert the longitude/latitude coordinates of the original data to a geocentric coordinate system using ST_Transform. A "geocentric" system is one in which the origin is the center of the Earth, and positions are defined by their X, Y and Z distances from that center.

In a geocentric system, positions on either side of the dateline are still very close together in space, so it's great for clustering global data without worrying about the effects of the poles or date line. For this example we will use EPSG:4978 as our geocentric system.

Here are the coordinates of New York, converted to geocentric.

```
SELECT ST_AsText(ST_Transform(ST_PointZ(74.0060, 40.7128, 0, 4326), 4978), 1);
```

```
POINT Z (1333998.5 4654044.8 4138300.2)
```

And here is the cluster operation performed in geocentric space.

```
CREATE TABLE popplaces_geocentric AS
SELECT geom, pop_max, name,
ST_ClusterKMeans(
ST_Transform(
ST_Force3D(geom),
4978),
10) OVER () AS cluster
FROM popplaces;
```

The results look very similar to the planar clustering, but you can see the "whole world" effect in a few places, like how Australia and all the islands of Oceania are now in one cluster, and how the dividing point between the Siberia and Alaska clusters has moved west across the date line.

It's worth noting that this clustering has been performed in three dimensions (since geocentric coordinates require an X, Y and Z), even though we are displaying the results in two dimensions.

In addition to naïve k-means, ST_ClusterKMeans can carry out weighted k-means clustering, to push the cluster locations around using extra information in the "M" dimension (the fourth coordinate) of the input points.

Since we have a "populated places" data set, it makes sense to use population as a weight for this example. The weighted algorithm requires strictly positive weights, so we filter out the handful of records that are non-positive.

```
CREATE TABLE popplaces_geocentric_weighted AS
SELECT geom, pop_max, name,
ST_ClusterKMeans(
ST_Force4D(
ST_Transform(ST_Force3D(geom), 4978),
mvalue => pop_max
),
10) OVER () AS cluster
FROM popplaces
WHERE pop_max > 0;
```

Again, the differences are subtle, but note how India is now a single cluster, how the Brazil cluster is now biased towards the populous eastern coast, and how North America is now split into east and west.

]]>The ST_ClusterDBSCAN function
in PostGIS is a quick and easy way to extract clusters from point data. DBSCAN
specifically works with density and is well suited for population or density
type spatial data. To demonstrate `ST_ClusterDBSCAN`

I'm going to work with the
geographic names data, specifically the schools, and show how we can quickly
create a U.S. population density map.

Let's explore clustering using geographic names data.

Create a table to hold the data. Note that the table is generating the points automatically from the longitude/latitude (EPSG:4326) and transforming into a planar projection for the USA (EPSG:5070).

```
CREATE TABLE geonames (
geonameid integer,
name text,
asciiname text,
alternatenames text,
latitude float8,
longitude float8,
fclass char,
fcode text,
country text,
cc2 text,
admin1 text,
admin2 text,
admin3 text,
admin4 text,
population bigint,
elevation integer,
dem text,
timezone text,
modification date,
geom geometry(point, 5070)
GENERATED ALWAYS AS
(ST_Transform(ST_Point(longitude, latitude, 4326),5070)) STORED
);
```

Now load the table. Note the super fun use of `PROGRAM`

to pull data directly
from the web and feed a `COPY`

.

```
COPY geonames
FROM PROGRAM '(curl http://download.geonames.org/export/dump/US.zip > /tmp/US.zip) && unzip -p /tmp/US.zip US.txt'
WITH (FORMAT CSV, DELIMITER E'\t', HEADER false);
```

(This trick only works using the `postgres`

superuser, since it involves calling
a program and writing to system disk. If you do not have superuser access,
download and unzip the `US.TXT`

file by hand and
load it
using `COPY`

from the file.)

Finally, add a spatial index to the `geom`

column.

```
CREATE INDEX geonames_geom_x
ON geonames
USING GIST (geom);
```

There are 434 distinct feature codes in the `geonames`

table. We will restrict
our analysis to just the 205,848 schools, with an `fcode`

of `SCH`

.

```
SELECT Count(DISTINCT fcode) FROM geonames;
SELECT Count(fcode) FROM geonames WHERE fcode = 'SCH';
```

Schools are an interesting feature to analyze because there's a nice strong correlation between the number of schools and the population. There's a lot of schools! But they are not uniformly distributed.

If we zoom into the midwest, the concentration of schools in populated places
pops out. **We can use PostGIS to turn this distribution difference into a data
set of populated places!**

The DBSCAN clustering algorithm is a "density based spatial clustering of applications with noise". The PostGIS ST_ClusterDBSCAN implementation is a window function that takes three parameters:

- The geometries to be analyzed for clusters.
- A 'eps' distance tolerance. Geometries must be within this distance to be added to a cluster.
- A 'minpoints' count. If a point is within the 'eps' distance of 'minpoints' cluster members, it is a "core member" of the cluster.

An input geometry is added to a cluster if it is either:

- A "core" geometry, that is within eps distance of at least minpoints input geometries (including itself); or
- A "border" geometry, that is within eps distance of a core geometry.

How does this play out in practice?

If we zoom further into Chicago, around the suburban/exurban transition, the schools are about 1000 meters apart, sometimes more sometimes less, transitioning out to 2000 meters and more in the exurbs.

For our clusters, we will use:

- A
`eps`

distance of 2000m - A
`minpoints`

of 5 - A partition on the state code (
`admin1`

) to cut down on the number of cluster numbers.

```
CREATE TABLE geonames_sch AS
SELECT ST_ClusterDBScan(geom, 2000, 5)
OVER (PARTITION BY admin1) AS cluster, *
FROM geonames
WHERE fcode = 'SCH';
```

The result looks like this, with each cluster given a distinct color, and un-clustered schools rendered transparent.

The smaller clusters look a little arbitrary, but if we zoom in, we can see that even small population centers have been surfaced with this analytical technique.

Here is Kanakee, Illinois, neatly identified as a populated place by its cluster of schools.

Now that we have clusters, getting a populated place point is as simple as using the ST_Centroid function.

```
CREATE TABLE geonames_popplaces AS
SELECT ST_Centroid(ST_Collect(geom))::geometry(Point, 5070) AS geom,
Count(*) AS school_count,
cluster, admin1
FROM geonames_sch
GROUP BY cluster, admin1
```

We have completed the analysis, converting the density difference in school locations into a set of derived populated place points.

Now for the whole population cluster map!

- Create a table
`ST_ClusterDBScan`

- Set an
`eps`

for distance tolerance - Set a
`minpoints`

to reduce density - Partition on a different field to cut down on the number of cluster numbers.

- Set an
- Create a final table using the
`ST_Centroid`

PostgreSQL comes with just a few simple foundational functions that can be used to fulfill most needs for randomness.

Almost all your random-ness needs will be met with the `random()`

function.

The `random()`

function returns a double precision float in a
continuous uniform distribution
between 0.0 and 1.0.

What does that mean? It means that you could get any value between 0.0 and 1.0,
with equal probability, for each call of `random()`

.

Here's five uniform random numbers between 0.0 and 1.0.

```
SELECT random() FROM generate_series(1, 5)
```

```
0.3978842227698167
0.7438732417540841
0.3875091442400458
0.4108009373061563
0.5524543763568912
```

Yep, those look pretty random! But, maybe not so useful?

Most times when people are trying to generate random numbers, they are looking
for random **integers** in a range, not random floats between 0.0 and 1.0.

Say you wanted random integers between 1 and 10, inclusive. How do you get that,
starting from `random()`

?

Start by scaling an ordinary `random()`

number up be a factor of 10! Now you
have a continuous distribution between 0 and 10.

```
SELECT 10 * random() FROM generate_series(1, 5)
```

```
3.978842227698167
7.438732417540841
3.875091442400458
4.108009373061563
5.5245437635689125
```

Then, if you push every one of those numbers down to the nearest integer using
`floor()`

you'll end up with a random integer between 0 and 9.

```
SELECT floor(10 * random()) FROM generate_series(1, 5)
```

```
4
8
4
5
6
```

If you wanted a random integer between 1 and 10, you just need to add 1 to the zero-base number.

```
SELECT floor(10 * random()) + 1 FROM generate_series(1, 5)
```

```
3
7
3
4
5
```

Sometimes the things you are trying to do randomly aren't numbers. How do you get a random entry out of a string? Or a random row from a table?

We already saw how to get one-based integers from `random()`

and we can apply
that technique to the problem of pulling an entry from an array.

```
WITH f AS (
SELECT ARRAY[
'apple',
'banana',
'cherry',
'pear',
'peach'] AS fruits
)
SELECT fruits[ceil(array_length(fruits,1) * random())] AS snack
FROM f;
```

```
snack
-------
peach
```

Getting a random row involves some tradeoffs and thinking. For a random value from a small table, the naive way to get a single random value is this.

```
SELECT *
FROM fruits
ORDER BY random()
LIMIT 1
```

As you can imagine, this gets quite expensive if the `fruits`

table gets too
large, since it sorts the whole table every time.

If you only need a single random row, one way to achieve that is to add a random column to your table and index it.

```
CREATE TABLE fruits (
id SERIAL PRIMARY KEY,
fruit TEXT NOT NULL,
random FLOAT8 DEFAULT random()
);
INSERT INTO fruits (fruit)
VALUES ('apple'),('banana'),('cherry'),('pear'),('peach');
CREATE INDEX fruits_random_x ON fruits (random);
```

Then when it's time to search, use the random function to generate a starting search location and find the next highest value.

```
SELECT *
FROM fruits
WHERE random > random()
ORDER BY random ASC
LIMIT 1;
```

```
id | fruit | random
----+--------+--------------------
8 | banana | 0.1997961574379754
```

Be careful using this trick for more than one row though: since the values in the random column are fixed, the sequences of rows returned will be deterministic, even if the start row is random.

If you want to pull large portions of a table into a query (for random sampling,
for example) look at the `TABLESAMPLE`

clause of the
`SELECT`

command.

Suppose I wanted the entire contents of the fruits collection, but returned in two random groups? This is actually much like getting a single random value: order the whole set randomly, and then use that ordering to determine grouping.

```
WITH random_fruits AS (
SELECT id, fruit
FROM fruits
ORDER BY random()
)
SELECT row_number() over () % 2 AS group,
id, fruit
FROM random_fruits
ORDER BY 1;
```

```
group | id | fruit
-------+----+--------
0 | 11 | peach
0 | 8 | banana
1 | 10 | pear
1 | 7 | apple
1 | 9 | cherry
```

The '2' in the example above is the number of groups desired.

`random_normal`

So far we have just been looking at ways to permute the uniform distribution
offered by the `random()`

function. But there is in fact an infinite number of
other probability distributions that random numbers could be a part of.

Of that infinite collection, by far the most frequently used in practice is the "normal distribution" also known as the "Gaussian distribution" or "bell curve".

Rather than having a hard cut-off point, the normal distribution has a frequent center and then ever lower probability of values out to infinity in both directions.

The position of the center of the distribution is the "mean" and the rate of probability decay is controlled by the "standard deviation".

To generate normally distributed data in PostgreSQL, use the
`random_normal(mean, stddev)`

function that was introduced in
version 16.

```
SELECT random_normal(0, 1)
FROM generate_series(1,10)
ORDER BY 1
```

```
-0.8147201382612904
-0.5751449000210354
-0.4643454485382744
-0.0630592935151314
0.26438942114339203
0.39298889191244274
0.4946046063256206
0.8560911955145666
1.3534309793797454
1.664493506727331
```

It's kind of hard to appreciate that the data have a central tendency without generating a lot more of them and counting how many fall within each bin.

```
SELECT random_normal()::integer,
Count(*)
FROM generate_series(1,1000)
GROUP BY 1
ORDER BY 1
```

The cast to `integer`

rounds the values towards the nearest integer, so you can
see how the data are mostly between the first two standard deviations of the
mean.

```
random_normal | count
---------------+-------
-3 | 5
-2 | 65
-1 | 233
0 | 378
1 | 246
2 | 67
3 | 5
4 | 1
```

If you looked **very** closely at the examples in the first section you'll have
noticed that they all started from the same, allegedly random values.

If `random()`

truly is random, how did I get the same starting values four times
in a row?

The answer, shockingly, is that `random()`

is actually
"pseudo-random".

A pseudorandom sequence of numbers is one that appears to be statistically random, despite having been produced by a completely deterministic and repeatable process.

With a pseudo-random number generator and a known starting point, I will always get the same sequence of numbers, at least on the same computer.

The reason most computer programs use pseudo-random number generators is that generating truly random numbers is actually quite an expensive operation (relatively speaking).

So programs instead generate one truly random number, and use that as a "seed" for a generator.

PostgreSQL uses the Blackman/Vigna "xoroshiro128 1.0" pseudo-random number generator.

By default, on start-up PostgreSQL sets up a seed value by calling an external random number generator, using an appropriate method for the platform:

- Using OpenSSL
`RAND_bytes()`

if available, or - using Windows
`CryptGenRandom()`

on that platform, or - using the operating system
`/dev/urandom`

if necessary.

So if you are interested in a random number, just calling `random()`

will get
you one every time.

But if you want to put your finger on the scales, you can use the `setseed()`

function to cause your `random()`

and `random_normal()`

functions to generate a
deterministic series of random numbers, starting from a seed value you specify.

Random data is important for validating processing chains, analyses and reports. The best way to test a process is to feed it inputs!

Random points is pretty easy -- define an area of interest and then use the
PostgreSQL `random()`

function to create
the X and Y values in that area.

```
CREATE TABLE random_points AS
WITH bounds AS (
SELECT 0 AS origin_x,
0 AS origin_y,
80 AS width,
80 AS height
)
SELECT ST_Point(width * (random() - 0.5) + origin_x,
height * (random() - 0.5) + origin_y,
4326)::Geometry(Point, 4326) AS geom,
id
FROM bounds,
generate_series(0, 100) AS id
```

Filling a target shape with random points is a common use case, and there's a
special function just for that,
`ST_GeneratePoints()`

. Here
we generate points inside a circle created with
`ST_Buffer()`

.

```
CREATE TABLE random_points AS
SELECT ST_GeneratePoints(
ST_Buffer(
ST_Point(0, 0, 4326),
50),
100) AS geom
```

If you have PostgreSQL 16, you can use the brand new
`random_normal()`

function to
generate coordinates with a central tendency.

```
CREATE TABLE random_normal_points AS
WITH bounds AS (
SELECT 0 AS origin_x,
0 AS origin_y,
80 AS width,
80 AS height
)
SELECT ST_Point(random_normal(origin_x, width/4),
random_normal(origin_y, height/4),
4326)::Geometry(Point, 4326) AS geom,
id
FROM bounds,
generate_series(0, 100) AS id
```

`random_normal()`

.```
CREATE OR REPLACE FUNCTION random_normal(
mean double precision DEFAULT 0.0,
stddev double precision DEFAULT 1.0)
RETURNS double precision AS
$$
DECLARE
u1 double precision;
u2 double precision;
z0 double precision;
z1 double precision;
BEGIN
u1 := random();
u2 := random();
z0 := sqrt(-2.0 * ln(u1)) * cos(2.0 * pi() * u2);
z1 := sqrt(-2.0 * ln(u1)) * sin(2.0 * pi() * u2);
RETURN mean + (stddev * z0);
END;
$$ LANGUAGE plpgsql;
```

Linestrings are a little harder, because they involve more points, and aesthetically we like to avoid self-crossings of lines.

Two-point linestrings are pretty easy to generate with
`ST_MakeLine()`

-- just generate
twice as many random points, and use them as the start and end points of the
linestrings.

```
CREATE TABLE random_2point_lines AS
WITH bounds AS (
SELECT 0 AS origin_x, 80 AS width,
0 AS origin_y, 80 AS height
)
SELECT ST_MakeLine(
ST_Point(random_normal(origin_x, width/4),
random_normal(origin_y, height/4),
4326),
ST_Point(random_normal(origin_x, width/4),
random_normal(origin_y, height/4),
4326))::Geometry(LineString, 4326) AS geom,
id
FROM bounds,
generate_series(0, 100) AS id
```

Multi-point random linestrings are harder, at least while avoiding self-intersections, and there are a lot of potential approaches. While a recursive CTE could probably do it, an imperative approach using PL/PgSQL is more readable.

The `generate_random_linestring()`

function starts with an empty linestring, and
then adds on new segments one at a time, changing the direction of the line with
each new segment.

`generate_random_linestring()`

definition.
```
CREATE OR REPLACE FUNCTION generate_random_linestring(
start_point geometry(Point))
RETURNS geometry(LineString, 4326) AS
$$
DECLARE
num_segments integer := 10; -- Number of segments in the linestring
deviation_max float := radians(45); -- Maximum deviation
random_point geometry(Point);
deviation float;
direction float := 2 * pi() * random();
segment_length float := 5; -- Length of each segment (adjust as needed)
i integer;
result geometry(LineString) := 'SRID=4326;LINESTRING EMPTY';
BEGIN
result := ST_AddPoint(result, start_point);
FOR i IN 1..num_segments LOOP
-- Generate a random angle within the specified deviation
deviation := 2 * deviation_max * random() - deviation_max;
direction := direction + deviation;
-- Calculate the coordinates of the next point
random_point := ST_Point(
ST_X(start_point) + cos(direction) * segment_length,
ST_Y(start_point) + sin(direction) * segment_length,
ST_SRID(start_point)
);
-- Add the point to the linestring
result := ST_AddPoint(result, random_point);
-- Update the start point for the next segment
start_point := random_point;
END LOOP;
RETURN result;
END;
$$
LANGUAGE plpgsql;
```

We can use the `generate_random_linestring()`

function now to turn random start
points (created in the usual way) into fully random squiggly lines!

```
CREATE TABLE random_lines AS
WITH bounds AS (
SELECT 0 AS origin_x, 80 AS width,
0 AS origin_y, 80 AS height
)
SELECT id,
generate_random_linestring(
ST_Point(random_normal(origin_x, width/4),
random_normal(origin_y, height/4),
4326))::Geometry(LineString, 4326) AS geom
FROM bounds,
generate_series(1, 100) AS id;
```

At the simplest level, a set of random boxes is a set of random polygons, but
that's pretty boring, and easy to generate using
`ST_MakeEnvelope()`

.

```
CREATE TABLE random_boxes AS
WITH bounds AS (
SELECT 0 AS origin_x, 80 AS width,
0 AS origin_y, 80 AS height
)
SELECT ST_MakeEnvelope(
random_normal(origin_x, width/4),
random_normal(origin_y, height/4),
random_normal(origin_x, width/4),
random_normal(origin_y, height/4)
)::Geometry(Polygon, 4326) AS geom,
id
FROM bounds,
generate_series(0, 20) AS id
```

But more interesting polygons have curvy and convex shapes, how can we generate those?

One way is to extract a polygon from a set of random points, using
`ST_ConcaveHull()`

, and then
applying an "erode and dilate" effect to make the curves more pleasantly round.

We start with a random center point for each polygon, and create a circle with
`ST_Buffer()`

.

Then use
`ST_GeneratePoints()`

to fill
the circle with some random points -- not too many, so we get a nice jagged
result.

Then use `ST_ConcaveHull()`

to trace a "boundary" around those points.

Then apply a negative buffer, to erode the shape.

And finally a positive buffer to dilate it back out again.

Generating multiple hulls involves stringing together all the above operations with CTEs or subqueries.

```
CREATE TABLE random_hulls AS
WITH bounds AS (
SELECT 0 AS origin_x,
0 AS origin_y,
80 AS width,
80 AS height
),
polypts AS (
SELECT ST_Point(random_normal(origin_x, width/2),
random_normal(origin_y, width/2),
4326)::Geometry(Point, 4326) AS geom,
polyid
FROM bounds,
generate_series(1,10) AS polyid
),
pts AS (
SELECT ST_GeneratePoints(ST_Buffer(geom, width/5), 20) AS geom,
polyid
FROM bounds,
polypts
)
SELECT ST_Multi(ST_Buffer(
ST_Buffer(
ST_ConcaveHull(geom, 0.3),
-2.0),
3.0))::Geometry(MultiPolygon, 4326) AS geom,
polyid
FROM pts;
```

Another approach is to again start with random points, but use the Voronoi diagram as the basis of the polygon.

Start with a center point and buffer circle.

Generate random points in the circle.

Use the
`ST_VoronoiPolygons()`

function to generate polygons that subdivide the space using the random points
as seeds.

Filter just the polygons that are fully contained in the originating circle.

And then use `ST_Union()`

to merge
those polygons into a single output shape.

Generating multiple hulls again involves stringing together the above operations with CTEs or subqueries.

```
CREATE TABLE random_delaunay_hulls AS
WITH bounds AS (
SELECT 0 AS origin_x,
0 AS origin_y,
80 AS width,
80 AS height
),
polypts AS (
SELECT ST_Point(random_normal(origin_x, width/2),
random_normal(origin_y, width/2),
4326)::Geometry(Point, 4326) AS geom,
polyid
FROM bounds,
generate_series(1,20) AS polyid
),
voronois AS (
SELECT ST_VoronoiPolygons(
ST_GeneratePoints(ST_Buffer(geom, width/5), 10)
) AS geom,
ST_Buffer(geom, width/5) AS geom_clip,
polyid
FROM bounds,
polypts
),
cells AS (
SELECT (ST_Dump(geom)).geom, polyid, geom_clip
FROM voronois
)
SELECT ST_Union(geom)::Geometry(Polygon, 4326) AS geom, polyid
FROM cells
WHERE ST_Contains(geom_clip, geom)
GROUP BY polyid;
```

]]>
Truly this is a bad map projection, on a par with the previous five:

The last two are just applications of common map projections with very uncommon projection parameters that accentuate certain areas of the globe, a cartographic version of the classic "View of the World from 9th Avenue".

A colleague asked me if we could recreate ABS(Longitude) and I figured it was worth a try!

At a minimum, we want a countries layer and some independent place labels to provide context, which is available at the first stop for basic global data, Natural Earth.

We have been playing with ogr2ogr and
weird
remote access tricks
lately, and we can use `ogr2ogr`

to load the data in one step.

```
# Load the countries and places directly from the remote
# zip file into the working PostgreSQL database
ogr2ogr \
-f PostgreSQL \
-nlt PROMOTE_TO_MULTI \
-lco OVERWRITE=yes \
-lco GEOMETRY_NAME=geom \
postgresql://postgres@localhost/xkcd \
/vsizip//vsicurl/https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/cultural/ne_110m_admin_0_countries.zip
ogr2ogr \
-f PostgreSQL \
-lco OVERWRITE=yes \
-lco GEOMETRY_NAME=geom \
postgresql://postgres@localhost/xkcd \
/vsizip//vsicurl/https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/cultural/ne_110m_populated_places.zip
```

Now we have the data in the database, read to go!

The process we are going to apply will be transforming the shapes of one polygon
at a time, and the Natural Earth data models the countries with one
**MultiPolygon** per country.

Canada, for example, is one country, but 30 polygons.

We want a table with just one row for each polygon, so we "dump" all the
multi-polygons using `ST_Dump()`

.

```
CREATE SEQUENCE country_id;
CREATE TABLE countries AS
SELECT nextval('country_id') AS id,
name,
(ST_Dump(geom)).geom::geometry(Polygon, 4326) AS geom
FROM ne_110m_admin_0_countries;
```

Next, because we are going to be processing western shapes with negative longitude different from eastern shapes we have to solve the problem: what to do with shapes that straddle the prime meridian?

The answer: `ST_Split()`

!

First we create a prime meridian geometry to use as a "splitting blade".

```
CREATE TABLE lon_0 AS
SELECT ST_SetSrid(
ST_MakeLine(
ST_MakePoint(0,90),
ST_MakePoint(0,-90)),
4326)::geometry(LineString, 4326) AS geom;
```

Then we apply that blade to all the shapes that fall under it.

```
CREATE TABLE split_at_0 AS
SELECT id, name, ST_CollectionHomogenize(
ST_Split(c.geom, lon_0.geom))::geometry(MultiPolygon, 4326) AS geom
FROM countries c
JOIN lon_0
ON ST_Intersects(c.geom, lon_0.geom);
```

Surprisingly few countries end up chopped by the meridian.

The output of the split is, for each input polygon, a multi-polygon of the components. But we want to operate on the shapes one polygon at a time, so again, we must dump the multi-polygon into its components.

A slightly longer query dumps the split shapes, and stores them in a table with the rest of the un-split polygons, labeling each shape depending on whether it is "west" or "east" of the prime meridian.

```
CREATE TABLE countries_split AS
WITH split AS (
SELECT id, name, (ST_Dump(geom)).geom::geometry(Polygon, 4326) AS geom
FROM split_at_0
)
SELECT c.id, c.name, c.geom,
CASE WHEN ST_X(ST_StartPoint(c.geom)) >= 0 THEN 'east' ELSE 'west' END AS side
FROM countries c
LEFT JOIN split s
USING (id)
WHERE s.id IS NULL
UNION ALL
SELECT s.id, s.name, s.geom,
CASE WHEN ST_X(ST_StartPoint(s.geom)) >= 0 THEN 'east' ELSE 'west' END AS side
FROM split s;
```

We have divided the west from the east, and are ready for the final step.

Now we are ready to apply a transformation to all the "west" countries, to turn their negative longitudes into positive ones.

To do this, we will use the powerful
`ST_Affine()`

function.

The two-dimensional form of the function looks like this:

```
ST_Affine(geom, a, b, d, e, xoff, yoff)
```

Where the parameters correspond to an affine transformation matrix:

Or, in equation form:

From the equation it is pretty clear, we want to negate the input **x** and
leave everything else alone.

In order to get a pretty map, we'd like the output data to be centered on the prime meridian again, so:

And in SQL like this:

```
CREATE TABLE countries_affine AS
SELECT id, name,
CASE WHEN SIDE = 'west'
THEN ST_Affine(geom, -1, 0, 0, 1, -90, 0)
ELSE ST_Affine(geom, 1, 0, 0, 1, -90, 0)
END AS geom
FROM countries_split;
CREATE TABLE places_affine AS
SELECT ogc_fid AS id, name,
CASE WHEN ST_X(geom) < 0
THEN ST_Affine(geom, -1, 0, 0, 1, -90, 0)
ELSE ST_Affine(geom, 1, 0, 0, 1, -90, 0)
END AS geom
FROM ne_110m_populated_places
ORDER BY pop_max DESC;
```

And the final result on the map looks like the XKCD map, without the pretty hand-labeling and mountains:

The bad map projections aren't the only cartographic cartoons XKCD explored. If you liked this one, take a look at:

- Map Projections, all real!
- World According to Americans
- Upside Down
- Map Age Guide, fascinating history!