Introduction
# !pip install pzflow matplotlib
Intro to PZFlow¶
This notebook demonstrates building a normalizing flow with PZFlow to learn the joint probability distribution of some 2-D data.
You do not need to have any previous knowledge of normalizing flows to get started with PZFlow.
However if you are interested, here are some good sources:
- Eric Jang's tutorial: part 1, part 2
- Here is a list of papers, blogs, videos, and packages
- Two good intro papers using Coupling Layers: NICE, Real NVP
- The paper on Neural Spline Couplings
from pzflow import Flow
from pzflow.examples import get_twomoons_data
import jax.numpy as jnp
import matplotlib.pyplot as plt
plt.rcParams["figure.facecolor"] = "white"
First let's load some example data. It's the familiar two moons data set from scikit-learn, loaded in a Pandas DataFrame, which is the data format PZFlow uses on the user-end.
data = get_twomoons_data()
data
x | y | |
---|---|---|
0 | -0.748695 | 0.777733 |
1 | 1.690101 | -0.207291 |
2 | 2.008558 | 0.285932 |
3 | 1.291547 | -0.441167 |
4 | 0.808686 | -0.481017 |
... | ... | ... |
99995 | 1.642738 | -0.221286 |
99996 | 0.981221 | 0.327815 |
99997 | 0.990856 | 0.182546 |
99998 | -0.343144 | 0.877573 |
99999 | 1.851718 | 0.008531 |
100000 rows × 2 columns
Let's plot it to see what it looks like.
plt.hist2d(data["x"], data["y"], bins=200)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
Now let's build a normalizing flow. The details of constructing a normalizing flow are explored in the following tutorial notebooks, but for now, we can use the default flow built into PZFlow. This flow was designed to work well out-of-the-box for most data sets.
The only thing you are required to supply is the name of the columns in your data set.
As you can see in the pandas DataFrame above, our columns are named "x"
and "y"
.
flow = Flow(["x", "y"])
Now we can train our normalizing flow.
This is as simple as calling flow.train(data)
.
There are several training parameters you can set, including the number of epochs, the batch size, the optimizer, and the random seed.
See the Flow
documentation for more details.
For this example, let's use the defaults, but set verbose=True
so that training losses are printed throughout the training process.
%%time
losses = flow.train(data, verbose=True)
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Training 100 epochs Loss: (0) 2.3212 (1) 0.7040 (6) 0.3620 (11) 0.3410 (16) 0.3193 (21) 0.3124 (26) 0.3140 (31) 0.3226 (36) 0.3028 (41) 0.3105 (46) 0.3054 (51) 0.2984 (56) 0.3031 (61) 0.2963 (66) 0.3060 (71) 0.3016 (76) 0.3027 (81) 0.2962 (86) 0.3023 (91) 0.3067 (96) 0.2991 (100) 0.2978 CPU times: user 4min 29s, sys: 46.5 s, total: 5min 16s Wall time: 1min 35s
Now let's plot the training losses to make sure everything looks like we expect it to...
plt.plot(losses)
plt.xlabel("Epoch")
plt.ylabel("Training loss")
plt.show()
Perfect!
Now we can draw samples from the flow, using the sample
method.
Let's draw 10000 samples and make another histogram to see if it matches the data.
samples = flow.sample(10_000, seed=0)
plt.hist2d(samples["x"], samples["y"], bins=200)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
Looks great!
We can also use the flow to calculate redshift posteriors using the posterior
method. We need to provide the name of the column we want to calculate a posterior for, as well as a grid on which to calculate the posterior.
grid = jnp.linspace(-2, 2, 100)
pdfs = flow.posterior(data, column="x", grid=grid)
The result is a big array of posteriors:
pdfs
Array([[0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 7.9137310e-05, 2.5412094e-04, 3.7483254e-04], [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 2.2376558e-02, 1.1772527e-02, 7.5370832e-03], [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 1.0527854e+00, 1.8525983e+00, 1.8694268e+00], ..., [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 1.6310947e+00, 1.8701038e+00, 1.2859117e+00], [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 2.3276938e-05, 7.8079924e-05, 1.1970673e-04], [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 1.7900342e+00, 8.5315311e-01, 1.5609255e-01]], dtype=float32)
Let's plot the first posterior.
plt.plot(grid, pdfs[0])
plt.title(f"$y$ = {data['y'][0]:.2f}")
plt.xlabel("$x$")
plt.ylabel("$p(x|y)$")
plt.show()
Now let's store some information with the flow about the data it was trained on.
import pzflow
flow.info = f"""
This is an example flow, trained on 100,000 points from the scitkit-learn
two moons data set.
The data set used to train this flow is available in the `examples` module:
>>> from pzflow.examples import two_moons_data
>>> data = get_twomoons_data()
This flow was created with pzflow version {pzflow.__version__}
"""
print(flow.info)
This is an example flow, trained on 100,000 points from the scitkit-learn two moons data set. The data set used to train this flow is available in the `examples` module: >>> from pzflow.examples import two_moons_data >>> data = get_twomoons_data() This flow was created with pzflow version 3.1.0
Now let's save the flow to a file that can be loaded later:
flow.save("example_flow.pzflow.pkl")
This file can be loaded on Flow instantiation:
flow = Flow(file="example_flow.pzflow.pkl")
Tracking the validation loss¶
Often in machine learning applications, we want to track validation loss as we train to make sure that we don't overfit. This is very easy to do with PZFlow. Below, we will repeat the training above, except this time we will provide a validation set to the training function.
First, let's split the data set into a training and validation set:
# 80% for the training set
train_set = data[: int(0.8 * len(data))]
# 20% for the training set
val_set = data[int(0.8 * len(data)) :]
%%time
flow = Flow(["x", "y"])
train_losses, val_losses = flow.train(train_set, val_set, verbose=True)
Training 100 epochs Loss: (0) 2.3199 2.3266 (1) 0.8355 0.8384 (6) 0.3615 0.3666 (11) 0.3367 0.3402 (16) 0.3243 0.3289 (21) 0.3074 0.3112 (26) 0.3118 0.3160 (31) 0.3527 0.3557 (36) 0.3215 0.3239 (41) 0.3218 0.3250 (46) 0.3075 0.3121 (51) 0.3153 0.3196 (56) 0.3564 0.3570 (61) 0.3089 0.3132 (66) 0.3024 0.3075 (71) 0.3291 0.3334 (76) 0.3014 0.3073 (81) 0.3025 0.3098 (86) 0.3099 0.3153 (91) 0.3163 0.3203 (96) 0.2967 0.3035 (100) 0.3042 0.3093 CPU times: user 3min 45s, sys: 43.2 s, total: 4min 28s Wall time: 1min 22s
Now during training, the training losses are printed on the left, and the validation losses on the right.
Let's plot the losses again:
plt.plot(train_losses, label="Training")
plt.plot(val_losses, label="Validation")
plt.legend()
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()
In this example, the validation loss closely tracks the training loss. This is expected, since both sets were drawn from the same distribution.
Note that by default, PZFlow will use the parameters corresponding to the epoch with the lowest validation loss.