Introduction

If running in Colab, to switch to GPU, go to the menu and select Runtime -> Change runtime type -> Hardware accelerator -> GPU.

In addition, uncomment and run the following code:

In [1]:

Copied!

# !pip install pzflow matplotlib
# !pip install pzflow matplotlib

Intro to PZFlow¶

This notebook demonstrates building a normalizing flow with PZFlow to learn the joint probability distribution of some 2-D data.

You do not need to have any previous knowledge of normalizing flows to get started with PZFlow.

However if you are interested, here are some good sources:

Eric Jang's tutorial: part 1, part 2
Here is a list of papers, blogs, videos, and packages
Two good intro papers using Coupling Layers: NICE, Real NVP
The paper on Neural Spline Couplings

In [1]:

Copied!

from pzflow import Flow
from pzflow.examples import get_twomoons_data

import jax.numpy as jnp
import matplotlib.pyplot as plt
from pzflow import Flow
from pzflow.examples import get_twomoons_data

import jax.numpy as jnp
import matplotlib.pyplot as plt

In [2]:

Copied!

plt.rcParams["figure.facecolor"] = "white"
plt.rcParams["figure.facecolor"] = "white"

First let's load some example data. It's the familiar two moons data set from scikit-learn, loaded in a Pandas DataFrame, which is the data format PZFlow uses on the user-end.

In [3]:

Copied!

data = get_twomoons_data()
data
data = get_twomoons_data()
data

Out[3]:

	x	y
0	-0.748695	0.777733
1	1.690101	-0.207291
2	2.008558	0.285932
3	1.291547	-0.441167
4	0.808686	-0.481017
...	...	...
99995	1.642738	-0.221286
99996	0.981221	0.327815
99997	0.990856	0.182546
99998	-0.343144	0.877573
99999	1.851718	0.008531

100000 rows × 2 columns

Let's plot it to see what it looks like.

In [4]:

Copied!





plt.hist2d(data["x"], data["y"], bins=200)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
plt.hist2d(data["x"], data["y"], bins=200)
plt.xlabel("x")
plt.ylabel("y")
plt.show()

No description has been provided for this image

Now let's build a normalizing flow. The details of constructing a normalizing flow are explored in the following tutorial notebooks, but for now, we can use the default flow built into PZFlow. This flow was designed to work well out-of-the-box for most data sets.

The only thing you are required to supply is the name of the columns in your data set. As you can see in the pandas DataFrame above, our columns are named "x" and "y".

In [5]:

Copied!

flow = Flow(["x", "y"])
flow = Flow(["x", "y"])

Now we can train our normalizing flow. This is as simple as calling flow.train(data). There are several training parameters you can set, including the number of epochs, the batch size, the optimizer, and the random seed. See the Flow documentation for more details. For this example, let's use the defaults, but set verbose=True so that training losses are printed throughout the training process.

In [6]:

Copied!

%%time
losses = flow.train(data, verbose=True)
%%time
losses = flow.train(data, verbose=True)

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

Training 100 epochs 
Loss:
(0) 2.3212
(1) 0.7040
(6) 0.3620
(11) 0.3410
(16) 0.3193
(21) 0.3124
(26) 0.3140
(31) 0.3226
(36) 0.3028
(41) 0.3105
(46) 0.3054
(51) 0.2984
(56) 0.3031
(61) 0.2963
(66) 0.3060
(71) 0.3016
(76) 0.3027
(81) 0.2962
(86) 0.3023
(91) 0.3067
(96) 0.2991
(100) 0.2978
CPU times: user 4min 29s, sys: 46.5 s, total: 5min 16s
Wall time: 1min 35s

Now let's plot the training losses to make sure everything looks like we expect it to...

In [9]:

Copied!





plt.plot(losses)
plt.xlabel("Epoch")
plt.ylabel("Training loss")
plt.show()
plt.plot(losses)
plt.xlabel("Epoch")
plt.ylabel("Training loss")
plt.show()

Perfect!

Now we can draw samples from the flow, using the sample method. Let's draw 10000 samples and make another histogram to see if it matches the data.

In [10]:

Copied!

samples = flow.sample(10_000, seed=0)
samples = flow.sample(10_000, seed=0)

In [11]:

Copied!





plt.hist2d(samples["x"], samples["y"], bins=200)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
plt.hist2d(samples["x"], samples["y"], bins=200)
plt.xlabel("x")
plt.ylabel("y")
plt.show()

Looks great!

We can also use the flow to calculate redshift posteriors using the posterior method. We need to provide the name of the column we want to calculate a posterior for, as well as a grid on which to calculate the posterior.

In [12]:

Copied!

grid = jnp.linspace(-2, 2, 100)
pdfs = flow.posterior(data, column="x", grid=grid)
grid = jnp.linspace(-2, 2, 100)
pdfs = flow.posterior(data, column="x", grid=grid)

The result is a big array of posteriors:

In [13]:

Copied!

pdfs
pdfs

Out[13]:

Array([[0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 7.9137310e-05,
        2.5412094e-04, 3.7483254e-04],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 2.2376558e-02,
        1.1772527e-02, 7.5370832e-03],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 1.0527854e+00,
        1.8525983e+00, 1.8694268e+00],
       ...,
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 1.6310947e+00,
        1.8701038e+00, 1.2859117e+00],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 2.3276938e-05,
        7.8079924e-05, 1.1970673e-04],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 1.7900342e+00,
        8.5315311e-01, 1.5609255e-01]], dtype=float32)

Let's plot the first posterior.

In [14]:

Copied!





plt.plot(grid, pdfs[0])
plt.title(f"$y$ = {data['y'][0]:.2f}")
plt.xlabel("$x$")
plt.ylabel("$p(x|y)$")
plt.show()
plt.plot(grid, pdfs[0])
plt.title(f"$y$ = {data['y'][0]:.2f}")
plt.xlabel("$x$")
plt.ylabel("$p(x|y)$")
plt.show()

Now let's store some information with the flow about the data it was trained on.

In [15]:

Copied!





import pzflow

flow.info = f"""
This is an example flow, trained on 100,000 points from the scitkit-learn
two moons data set.

The data set used to train this flow is available in the `examples` module:
>>> from pzflow.examples import two_moons_data
>>> data = get_twomoons_data()

This flow was created with pzflow version {pzflow.__version__}
"""
import pzflow

flow.info = f"""
This is an example flow, trained on 100,000 points from the scitkit-learn
two moons data set.

The data set used to train this flow is available in the `examples` module:
>>> from pzflow.examples import two_moons_data
>>> data = get_twomoons_data()

This flow was created with pzflow version {pzflow.__version__}
"""

In [16]:

Copied!

print(flow.info)
print(flow.info)

This is an example flow, trained on 100,000 points from the scitkit-learn
two moons data set.

The data set used to train this flow is available in the `examples` module:
>>> from pzflow.examples import two_moons_data
>>> data = get_twomoons_data()

This flow was created with pzflow version 3.1.0

Now let's save the flow to a file that can be loaded later:

In [17]:

Copied!

flow.save("example_flow.pzflow.pkl")
flow.save("example_flow.pzflow.pkl")

This file can be loaded on Flow instantiation:

In [18]:

Copied!

flow = Flow(file="example_flow.pzflow.pkl")
flow = Flow(file="example_flow.pzflow.pkl")

Tracking the validation loss¶

Often in machine learning applications, we want to track validation loss as we train to make sure that we don't overfit. This is very easy to do with PZFlow. Below, we will repeat the training above, except this time we will provide a validation set to the training function.

First, let's split the data set into a training and validation set:

In [19]:

Copied!

# 80% for the training set
train_set = data[: int(0.8 * len(data))]

# 20% for the training set
val_set = data[int(0.8 * len(data)) :]
# 80% for the training set
train_set = data[: int(0.8 * len(data))]

# 20% for the training set
val_set = data[int(0.8 * len(data)) :]

In [23]:

Copied!

%%time
flow = Flow(["x", "y"])
train_losses, val_losses = flow.train(train_set, val_set, verbose=True)
%%time
flow = Flow(["x", "y"])
train_losses, val_losses = flow.train(train_set, val_set, verbose=True)

Training 100 epochs 
Loss:
(0) 2.3199  2.3266
(1) 0.8355  0.8384
(6) 0.3615  0.3666
(11) 0.3367  0.3402
(16) 0.3243  0.3289
(21) 0.3074  0.3112
(26) 0.3118  0.3160
(31) 0.3527  0.3557
(36) 0.3215  0.3239
(41) 0.3218  0.3250
(46) 0.3075  0.3121
(51) 0.3153  0.3196
(56) 0.3564  0.3570
(61) 0.3089  0.3132
(66) 0.3024  0.3075
(71) 0.3291  0.3334
(76) 0.3014  0.3073
(81) 0.3025  0.3098
(86) 0.3099  0.3153
(91) 0.3163  0.3203
(96) 0.2967  0.3035
(100) 0.3042  0.3093
CPU times: user 3min 45s, sys: 43.2 s, total: 4min 28s
Wall time: 1min 22s

Now during training, the training losses are printed on the left, and the validation losses on the right.

Let's plot the losses again:

In [28]:

Copied!





plt.plot(train_losses, label="Training")
plt.plot(val_losses, label="Validation")
plt.legend()
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()
plt.plot(train_losses, label="Training")
plt.plot(val_losses, label="Validation")
plt.legend()
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()

In this example, the validation loss closely tracks the training loss. This is expected, since both sets were drawn from the same distribution.

Note that by default, PZFlow will use the parameters corresponding to the epoch with the lowest validation loss.