Tutorial for Python Users

In this introductory tutorial, we go through the different steps of a CausalEGM workflow.

Users can use CausalEGM by Python API or R API or a single command line after installation.

First of all, you need to install CausalEGM, please refer to the install page.

Use CausalEGM Python API

[1]:

import CausalEGM as cegm
print("Currently use version v%s of CausalEGM."%cegm.__version__)

Currently use version v0.4.0 of CausalEGM.

Configuring a CausalEGM model

Before creating a CausalEGM model, a python dict object should be created for deploying the hyperparameters for a CausalEGM model, which include the dimensions for latent features, neural network architecture, etc.

The detailed hyperparameters are described as follows.

Config Parameters	Description
output_dir	Output directory to save the results during the model training. Default: “.”
dataset	Dataset name for indicating the input data. Default: “Mydata”
z_dims	Latent dimensions of the encoder outputs (e(V)_0~3). Default: [3,6,3,6]
v_dim	Dimension of covariates.
lr	Learning rate. Default: 0.0002
g_units	Number of units for decoder/generator network G. Default: [64,64,64,64,64].
e_units	Number of units for encoder network E. Default: [64,64,64,64,64].
f_units	Number of units for F network. Default: [64,32,8].
h_units	Number of units for H network. Default: [64,32,8].
dz_units	Number of units for discriminator network in latent space. Default: [64,32,8].
dz_units	Number of units for discriminator network in covariate space. Default: [64,32,8].
alpha	Coefficient for reconstruction loss. Default: 1.
beta	Coefficient for roundtrip loss. Default: 1.
gamma	Coefficient for gradient penalty loss. Default: 10.
g_d_freq	Frequency for updating discriminators and generators. Default: 5.
save_res	Whether to save results during the model training. Default: True.
save_model	Whether to save the model wegihts. Default: False.
binary_treatment	Whether to use binary treatment setting. Default: True.
use_z_rec	Use the reconstruction for latent features. Default: True.
use_v_gan	Use the GAN distribution match for covariates. Default: True.
x_min	Left bound for dose-response interval in continuous treatment settings. Default: 0.
x_max	Right bound for dose-response interval in continuous treatment settings. Default: 3.

Tips

Config parameters are necessary for creating a CausalEGM model. Here are some tips for configuring parameters.

z_dims has a noticeable impact on the performance, please refer to src/configs for guidance.
If save_res is True, results during training will be saved at output_dir
use_v_gan is recommended to be True under binary treatment setting and False under continous treatment setting.

Examples for loading config parameters

We provide many templates of the hyperparameters in CausalEGM/src/configs folder for different datasets/settings.

Users can use yaml to load the hyperparameters as a python dict object easily.

[2]:

import yaml
params = yaml.safe_load(open('../../src/configs/Semi_acic.yaml', 'r'))
print(params)

{'dataset': 'Semi_acic', 'output_dir': '.', 'v_dim': 177, 'z_dims': [3, 6, 3, 6], 'lr': 0.0002, 'alpha': 1, 'beta': 1, 'gamma': 10, 'g_d_freq': 5, 'g_units': [64, 64, 64, 64, 64], 'e_units': [64, 64, 64, 64, 64], 'f_units': [64, 32, 8], 'h_units': [64, 32, 8], 'dz_units': [64, 32, 8], 'dv_units': [64, 32, 8], 'save_res': True, 'save_model': False, 'binary_treatment': True, 'use_z_rec': True, 'use_v_gan': True}

Initilizing a CausalEGM model

It is super easy to create a CausalEGM model when the hyperparameters (params) are prepared.

timestamp should set to be None if you want to train a model from scratch rather than loading a pretrained model.

random_seed denotes the random seed used for reproducing the results.

[3]:

model = cegm.CausalEGM(params=params,random_seed=123)

Data preparation

Before training a CausalEGM model, we need to provide the data in a triplet, which contains treatment (x), potential outcome (y), and covariates (v).

Note that treatment (x) and potential outcome (y) should be either 1-dimensional array or with an additional axes of length one. Covariates should be a two-dimensional array.

Tips

There are three different ways to feed the training data to a CausalEGM model.

Loading an existing dataset from a data sampler.
Loading data from a python triplet list [x,y,v].
Loading data from a csv, txt, or npz file, where an example is provided at [path_to_CausalEGM]/test/demo.csv.

[4]:

#get the data from the ACIC 2018 competition dataset with a specified ufid.
x,y,v = cegm.Semi_acic_sampler(path='data/ACIC_2018',ufid='d5bd8e4814904c58a79d7cdcd7c2a1bb').load_all()
print(x.shape,y.shape,v.shape)

(50000, 1) (50000, 1) (50000, 177)

Run CausalEGM model training

Once data is ready, CausalEGM can be trained with the following parameters

Training parameters	Description
data	List object containing the triplet data [X,Y,V]. Default: None.
data_file	Str object denoting the path to the input file (csv, txt, npz). Default: None.
sep	Str object denoting the delimiter for the input file. Default: \t.
header	Int object denoting row number(s) to use as the column names. Default: 0.
normalize	Bool object denoting whether apply standard normalization to covariates. Default: False.
batch_size	Int object denoting the batch size in training. Default: 32.
n_iter	Int object denoting the training iterations. Default: 30000.
batches_per_eval	Int object denoting the number of iterations per evaluation. Default: 500.
batches_per_save	Int object denoting the number of iterations per save. Default: 10000.
startoff	Int object denoting the beginning iterations to jump without save and evaluation. Defalt: 0.
verbose	Bool object denoting whether showing the progress bar. Default: True.
save_format	Str object denoting the format (csv, txt, npz) to save the results. Default: txt.

[5]:

model.train(data=[x,y,v],n_iter=100,save_format='npy',verbose=False)

The average treatment effect (ATE) is  -0.0064516705

We train a CausalEGM for 100 iterations for illustration purpose, n_iter is recommended to be 30000.

The results are saved based on the output_dir parameter where causal_pre_at_[iter_number].[format] denotes the individual treatment effect (ITE) in binary treatment settings and average dose-response values in continuous treatment settings.

iter_number denotes the training iteraction and format is determined by save_format, which can be csv,txt, or npz.

Use CausalEGM by a command-line interface (CLI)

When installing the CausalEGM by pip, setuptools will add the console script to PATH and make it available for general use. This has advantage of being generalizeable to non-python scripts! This CLI takes a text file as input.

[6]:

!causalEGM -h

usage: causalEGM [-h] -output_dir OUTPUT_DIR -input INPUT [-dataset DATASET]
                 [--save-model | --no-save-model]
                 [--binary-treatment | --no-binary-treatment]
                 [-z_dims Z_DIMS [Z_DIMS ...]] [-lr LR] [-alpha ALPHA]
                 [-beta BETA] [-gamma GAMMA] [-g_d_freq G_D_FREQ]
                 [-g_units G_UNITS [G_UNITS ...]]
                 [-e_units E_UNITS [E_UNITS ...]]
                 [-f_units F_UNITS [F_UNITS ...]]
                 [-h_units H_UNITS [H_UNITS ...]]
                 [-dz_units DZ_UNITS [DZ_UNITS ...]]
                 [-dv_units DV_UNITS [DV_UNITS ...]]
                 [--use-z-rec | --no-use-z-rec] [--use-v-gan | --no-use-v-gan]
                 [-batch_size BATCH_SIZE] [-n_iter N_ITER]
                 [-startoff STARTOFF] [-batches_per_eval BATCHES_PER_EVAL]
                 [-save_format SAVE_FORMAT] [--save_res | --no-save_res]
                 [-seed SEED]

CausalEGM: A general causal inference framework by encoding generative
modeling - v0.4.0

optional arguments:
  -h, --help            show this help message and exit
  -output_dir OUTPUT_DIR
                        Output directory
  -input INPUT          Input data file must be in csv or txt or npz format
  -dataset DATASET      Dataset name
  --save-model, --no-save-model
                        whether to save model. (default: True)
  --binary-treatment, --no-binary-treatment
                        whether use binary treatment setting. (default: True)
  -z_dims Z_DIMS [Z_DIMS ...]
                        Latent dimensions of the four encoder outputs
                        e(V)_0~3.
  -lr LR                Learning rate for the optimizer (default: 0.0002).
  -alpha ALPHA          Coefficient for reconstruction loss (default: 1).
  -beta BETA            Coefficient for treatment and outcome MSE loss
                        (default: 1).
  -gamma GAMMA          Coefficient for gradient penalty loss (default: 10).
  -g_d_freq G_D_FREQ    Frequency for updating discriminators and generators
                        (default: 5).
  -g_units G_UNITS [G_UNITS ...]
                        Number of units for generator/decoder network
                        (default: [64,64,64,64,64]).
  -e_units E_UNITS [E_UNITS ...]
                        Number of units for encoder network (default:
                        [64,64,64,64,64]).
  -f_units F_UNITS [F_UNITS ...]
                        Number of units for f network (default: [64,32,8]).
  -h_units H_UNITS [H_UNITS ...]
                        Number of units for h network (default: [64,32,8]).
  -dz_units DZ_UNITS [DZ_UNITS ...]
                        Number of units for discriminator network in latent
                        space (default: [64,32,8]).
  -dv_units DV_UNITS [DV_UNITS ...]
                        Number of units for discriminator network in
                        confounder space (default: [64,32,8]).
  --use-z-rec, --no-use-z-rec
                        Use the reconstruction for latent features. (default:
                        True)
  --use-v-gan, --no-use-v-gan
                        Use the GAN distribution match for covariates.
                        (default: True)
  -batch_size BATCH_SIZE
                        Batch size (default: 32).
  -n_iter N_ITER        Number of iterations (default: 30000).
  -startoff STARTOFF    Iteration for starting evaluation (default: 0).
  -batches_per_eval BATCHES_PER_EVAL
                        Number of iterations per evaluation (default: 500).
  -save_format SAVE_FORMAT
                        Saving format (default: txt)
  --save_res, --no-save_res
                        Whether to save results during training. (default:
                        True)
  -seed SEED            Random seed for reproduction (default: 123).

The parameters are consistent with the Python APIs. Here, we use a demo data for an example!

[7]:

!causalEGM -input test/demo.csv -output_dir ./ -n_iter 100 -startoff 0 -batches_per_eval 50

2023-03-20 12:57:23.620713: W tensorflow/stream_executor/cuda/cuda_driver.cc:374] A non-primary context 0x57fa5c0 for device 0 exists before initializing the StreamExecutor. The primary context is now 0. We haven't verified StreamExecutor works with that.
2023-03-20 12:57:23.620890: F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_DEVICE_UNAVAILABLE: CUDA-capable device(s) is/are busy or unavailable