Tutorial for Python Users

In this introductory tutorial, we go through the different steps of a CausalEGM workflow.

Users can use CausalEGM by Python API or R API or a single command line after installation.

First of all, you need to install CausalEGM, please refer to the install page.

Use CausalEGM Python API

[1]:
import CausalEGM as cegm
print("Currently use version v%s of CausalEGM."%cegm.__version__)
Currently use version v0.4.0 of CausalEGM.

Configuring a CausalEGM model

Before creating a CausalEGM model, a python dict object should be created for deploying the hyperparameters for a CausalEGM model, which include the dimensions for latent features, neural network architecture, etc.

The detailed hyperparameters are described as follows.

Config Parameters

Description

output_dir

Output directory to save the results during the model training. Default: “.”

dataset

Dataset name for indicating the input data. Default: “Mydata”

z_dims

Latent dimensions of the encoder outputs (e(V)_0~3). Default: [3,6,3,6]

v_dim

Dimension of covariates.

lr

Learning rate. Default: 0.0002

g_units

Number of units for decoder/generator network G. Default: [64,64,64,64,64].

e_units

Number of units for encoder network E. Default: [64,64,64,64,64].

f_units

Number of units for F network. Default: [64,32,8].

h_units

Number of units for H network. Default: [64,32,8].

dz_units

Number of units for discriminator network in latent space. Default: [64,32,8].

dz_units

Number of units for discriminator network in covariate space. Default: [64,32,8].

alpha

Coefficient for reconstruction loss. Default: 1.

beta

Coefficient for roundtrip loss. Default: 1.

gamma

Coefficient for gradient penalty loss. Default: 10.

g_d_freq

Frequency for updating discriminators and generators. Default: 5.

save_res

Whether to save results during the model training. Default: True.

save_model

Whether to save the model wegihts. Default: False.

binary_treatment

Whether to use binary treatment setting. Default: True.

use_z_rec

Use the reconstruction for latent features. Default: True.

use_v_gan

Use the GAN distribution match for covariates. Default: True.

x_min

Left bound for dose-response interval in continuous treatment settings. Default: 0.

x_max

Right bound for dose-response interval in continuous treatment settings. Default: 3.

Tips

Config parameters are necessary for creating a CausalEGM model. Here are some tips for configuring parameters.

  1. z_dims has a noticeable impact on the performance, please refer to src/configs for guidance.

  2. If save_res is True, results during training will be saved at output_dir

  3. use_v_gan is recommended to be True under binary treatment setting and False under continous treatment setting.

Examples for loading config parameters

We provide many templates of the hyperparameters in CausalEGM/src/configs folder for different datasets/settings.

Users can use yaml to load the hyperparameters as a python dict object easily.

[2]:
import yaml
params = yaml.safe_load(open('../../src/configs/Semi_acic.yaml', 'r'))
print(params)
{'dataset': 'Semi_acic', 'output_dir': '.', 'v_dim': 177, 'z_dims': [3, 6, 3, 6], 'lr': 0.0002, 'alpha': 1, 'beta': 1, 'gamma': 10, 'g_d_freq': 5, 'g_units': [64, 64, 64, 64, 64], 'e_units': [64, 64, 64, 64, 64], 'f_units': [64, 32, 8], 'h_units': [64, 32, 8], 'dz_units': [64, 32, 8], 'dv_units': [64, 32, 8], 'save_res': True, 'save_model': False, 'binary_treatment': True, 'use_z_rec': True, 'use_v_gan': True}

Initilizing a CausalEGM model

It is super easy to create a CausalEGM model when the hyperparameters (params) are prepared.

timestamp should set to be None if you want to train a model from scratch rather than loading a pretrained model.

random_seed denotes the random seed used for reproducing the results.

[3]:
model = cegm.CausalEGM(params=params,random_seed=123)

Data preparation

Before training a CausalEGM model, we need to provide the data in a triplet, which contains treatment (x), potential outcome (y), and covariates (v).

Note that treatment (x) and potential outcome (y) should be either 1-dimensional array or with an additional axes of length one. Covariates should be a two-dimensional array.

Tips

There are three different ways to feed the training data to a CausalEGM model.

  1. Loading an existing dataset from a data sampler.

  2. Loading data from a python triplet list [x,y,v].

  3. Loading data from a csv, txt, or npz file, where an example is provided at [path_to_CausalEGM]/test/demo.csv.

[4]:
#get the data from the ACIC 2018 competition dataset with a specified ufid.
x,y,v = cegm.Semi_acic_sampler(path='data/ACIC_2018',ufid='d5bd8e4814904c58a79d7cdcd7c2a1bb').load_all()
print(x.shape,y.shape,v.shape)
(50000, 1) (50000, 1) (50000, 177)

Run CausalEGM model training

Once data is ready, CausalEGM can be trained with the following parameters

Training parameters

Description

data

List object containing the triplet data [X,Y,V]. Default: None.

data_file

Str object denoting the path to the input file (csv, txt, npz). Default: None.

sep

Str object denoting the delimiter for the input file. *Default: \t*.

header

Int object denoting row number(s) to use as the column names. Default: 0.

normalize

Bool object denoting whether apply standard normalization to covariates. Default: False.

batch_size

Int object denoting the batch size in training. Default: 32.

n_iter

Int object denoting the training iterations. Default: 30000.

batches_per_eval

Int object denoting the number of iterations per evaluation. Default: 500.

batches_per_save

Int object denoting the number of iterations per save. Default: 10000.

startoff

Int object denoting the beginning iterations to jump without save and evaluation. Defalt: 0.

verbose

Bool object denoting whether showing the progress bar. Default: True.

save_format

Str object denoting the format (csv, txt, npz) to save the results. Default: txt.

[5]:
model.train(data=[x,y,v],n_iter=100,save_format='npy',verbose=False)
The average treatment effect (ATE) is  -0.0064516705

We train a CausalEGM for 100 iterations for illustration purpose, n_iter is recommended to be 30000.

The results are saved based on the output_dir parameter where causal_pre_at_[iter_number].[format] denotes the individual treatment effect (ITE) in binary treatment settings and average dose-response values in continuous treatment settings.

iter_number denotes the training iteraction and format is determined by save_format, which can be csv,txt, or npz.

Use CausalEGM by a command-line interface (CLI)

When installing the CausalEGM by pip, setuptools will add the console script to PATH and make it available for general use. This has advantage of being generalizeable to non-python scripts! This CLI takes a text file as input.

[6]:
!causalEGM -h
usage: causalEGM [-h] -output_dir OUTPUT_DIR -input INPUT [-dataset DATASET]
                 [--save-model | --no-save-model]
                 [--binary-treatment | --no-binary-treatment]
                 [-z_dims Z_DIMS [Z_DIMS ...]] [-lr LR] [-alpha ALPHA]
                 [-beta BETA] [-gamma GAMMA] [-g_d_freq G_D_FREQ]
                 [-g_units G_UNITS [G_UNITS ...]]
                 [-e_units E_UNITS [E_UNITS ...]]
                 [-f_units F_UNITS [F_UNITS ...]]
                 [-h_units H_UNITS [H_UNITS ...]]
                 [-dz_units DZ_UNITS [DZ_UNITS ...]]
                 [-dv_units DV_UNITS [DV_UNITS ...]]
                 [--use-z-rec | --no-use-z-rec] [--use-v-gan | --no-use-v-gan]
                 [-batch_size BATCH_SIZE] [-n_iter N_ITER]
                 [-startoff STARTOFF] [-batches_per_eval BATCHES_PER_EVAL]
                 [-save_format SAVE_FORMAT] [--save_res | --no-save_res]
                 [-seed SEED]

CausalEGM: A general causal inference framework by encoding generative
modeling - v0.4.0

optional arguments:
  -h, --help            show this help message and exit
  -output_dir OUTPUT_DIR
                        Output directory
  -input INPUT          Input data file must be in csv or txt or npz format
  -dataset DATASET      Dataset name
  --save-model, --no-save-model
                        whether to save model. (default: True)
  --binary-treatment, --no-binary-treatment
                        whether use binary treatment setting. (default: True)
  -z_dims Z_DIMS [Z_DIMS ...]
                        Latent dimensions of the four encoder outputs
                        e(V)_0~3.
  -lr LR                Learning rate for the optimizer (default: 0.0002).
  -alpha ALPHA          Coefficient for reconstruction loss (default: 1).
  -beta BETA            Coefficient for treatment and outcome MSE loss
                        (default: 1).
  -gamma GAMMA          Coefficient for gradient penalty loss (default: 10).
  -g_d_freq G_D_FREQ    Frequency for updating discriminators and generators
                        (default: 5).
  -g_units G_UNITS [G_UNITS ...]
                        Number of units for generator/decoder network
                        (default: [64,64,64,64,64]).
  -e_units E_UNITS [E_UNITS ...]
                        Number of units for encoder network (default:
                        [64,64,64,64,64]).
  -f_units F_UNITS [F_UNITS ...]
                        Number of units for f network (default: [64,32,8]).
  -h_units H_UNITS [H_UNITS ...]
                        Number of units for h network (default: [64,32,8]).
  -dz_units DZ_UNITS [DZ_UNITS ...]
                        Number of units for discriminator network in latent
                        space (default: [64,32,8]).
  -dv_units DV_UNITS [DV_UNITS ...]
                        Number of units for discriminator network in
                        confounder space (default: [64,32,8]).
  --use-z-rec, --no-use-z-rec
                        Use the reconstruction for latent features. (default:
                        True)
  --use-v-gan, --no-use-v-gan
                        Use the GAN distribution match for covariates.
                        (default: True)
  -batch_size BATCH_SIZE
                        Batch size (default: 32).
  -n_iter N_ITER        Number of iterations (default: 30000).
  -startoff STARTOFF    Iteration for starting evaluation (default: 0).
  -batches_per_eval BATCHES_PER_EVAL
                        Number of iterations per evaluation (default: 500).
  -save_format SAVE_FORMAT
                        Saving format (default: txt)
  --save_res, --no-save_res
                        Whether to save results during training. (default:
                        True)
  -seed SEED            Random seed for reproduction (default: 123).

The parameters are consistent with the Python APIs. Here, we use a demo data for an example!

[7]:
!causalEGM -input test/demo.csv -output_dir ./ -n_iter 100 -startoff 0 -batches_per_eval 50
2023-03-20 12:57:23.620713: W tensorflow/stream_executor/cuda/cuda_driver.cc:374] A non-primary context 0x57fa5c0 for device 0 exists before initializing the StreamExecutor. The primary context is now 0. We haven't verified StreamExecutor works with that.
2023-03-20 12:57:23.620890: F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_DEVICE_UNAVAILABLE: CUDA-capable device(s) is/are busy or unavailable