Tutorial for Python Users
In this introductory tutorial, we go through the different steps of a CausalEGM workflow.
Users can use CausalEGM by Python API or R API or a single command line after installation.
First of all, you need to install CausalEGM, please refer to the install page.
Use CausalEGM Python API
[1]:
import CausalEGM as cegm
print("Currently use version v%s of CausalEGM."%cegm.__version__)
Currently use version v0.4.0 of CausalEGM.
Configuring a CausalEGM model
Before creating a CausalEGM model, a python dict
object should be created for deploying the hyperparameters for a CausalEGM model, which include the dimensions for latent features, neural network architecture, etc.
The detailed hyperparameters are described as follows.
Config Parameters |
Description |
---|---|
output_dir |
Output directory to save the results during the model training. Default: “.” |
dataset |
Dataset name for indicating the input data. Default: “Mydata” |
z_dims |
Latent dimensions of the encoder outputs (e(V)_0~3). Default: [3,6,3,6] |
v_dim |
Dimension of covariates. |
lr |
Learning rate. Default: 0.0002 |
g_units |
Number of units for decoder/generator network G. Default: [64,64,64,64,64]. |
e_units |
Number of units for encoder network E. Default: [64,64,64,64,64]. |
f_units |
Number of units for F network. Default: [64,32,8]. |
h_units |
Number of units for H network. Default: [64,32,8]. |
dz_units |
Number of units for discriminator network in latent space. Default: [64,32,8]. |
dz_units |
Number of units for discriminator network in covariate space. Default: [64,32,8]. |
alpha |
Coefficient for reconstruction loss. Default: 1. |
beta |
Coefficient for roundtrip loss. Default: 1. |
gamma |
Coefficient for gradient penalty loss. Default: 10. |
g_d_freq |
Frequency for updating discriminators and generators. Default: 5. |
save_res |
Whether to save results during the model training. Default: True. |
save_model |
Whether to save the model wegihts. Default: False. |
binary_treatment |
Whether to use binary treatment setting. Default: True. |
use_z_rec |
Use the reconstruction for latent features. Default: True. |
use_v_gan |
Use the GAN distribution match for covariates. Default: True. |
x_min |
Left bound for dose-response interval in continuous treatment settings. Default: 0. |
x_max |
Right bound for dose-response interval in continuous treatment settings. Default: 3. |
Tips
Config parameters are necessary for creating a CausalEGM model. Here are some tips for configuring parameters.
z_dims has a noticeable impact on the performance, please refer to src/configs for guidance.
If save_res is True, results during training will be saved at output_dir
use_v_gan is recommended to be True under binary treatment setting and False under continous treatment setting.
Examples for loading config parameters
We provide many templates of the hyperparameters in CausalEGM/src/configs
folder for different datasets/settings.
Users can use yaml
to load the hyperparameters as a python dict
object easily.
[2]:
import yaml
params = yaml.safe_load(open('../../src/configs/Semi_acic.yaml', 'r'))
print(params)
{'dataset': 'Semi_acic', 'output_dir': '.', 'v_dim': 177, 'z_dims': [3, 6, 3, 6], 'lr': 0.0002, 'alpha': 1, 'beta': 1, 'gamma': 10, 'g_d_freq': 5, 'g_units': [64, 64, 64, 64, 64], 'e_units': [64, 64, 64, 64, 64], 'f_units': [64, 32, 8], 'h_units': [64, 32, 8], 'dz_units': [64, 32, 8], 'dv_units': [64, 32, 8], 'save_res': True, 'save_model': False, 'binary_treatment': True, 'use_z_rec': True, 'use_v_gan': True}
Initilizing a CausalEGM model
It is super easy to create a CausalEGM model when the hyperparameters (params
) are prepared.
timestamp
should set to be None if you want to train a model from scratch rather than loading a pretrained model.
random_seed
denotes the random seed used for reproducing the results.
[3]:
model = cegm.CausalEGM(params=params,random_seed=123)
Data preparation
Before training a CausalEGM model, we need to provide the data in a triplet, which contains treatment (x
), potential outcome (y
), and covariates (v
).
Note that treatment (x
) and potential outcome (y
) should be either 1-dimensional array or with an additional axes of length one. Covariates should be a two-dimensional array.
Tips
There are three different ways to feed the training data to a CausalEGM model.
Loading an existing dataset from a data sampler.
Loading data from a python triplet list [x,y,v].
Loading data from a csv, txt, or npz file, where an example is provided at
[path_to_CausalEGM]/test/demo.csv
.
[4]:
#get the data from the ACIC 2018 competition dataset with a specified ufid.
x,y,v = cegm.Semi_acic_sampler(path='data/ACIC_2018',ufid='d5bd8e4814904c58a79d7cdcd7c2a1bb').load_all()
print(x.shape,y.shape,v.shape)
(50000, 1) (50000, 1) (50000, 177)
Run CausalEGM model training
Once data is ready, CausalEGM can be trained with the following parameters
Training parameters |
Description |
---|---|
data |
List object containing the triplet data [X,Y,V]. Default: None. |
data_file |
Str object denoting the path to the input file (csv, txt, npz). Default: None. |
sep |
Str object denoting the delimiter for the input file. *Default: \t*. |
header |
Int object denoting row number(s) to use as the column names. Default: 0. |
normalize |
Bool object denoting whether apply standard normalization to covariates. Default: False. |
batch_size |
Int object denoting the batch size in training. Default: 32. |
n_iter |
Int object denoting the training iterations. Default: 30000. |
batches_per_eval |
Int object denoting the number of iterations per evaluation. Default: 500. |
batches_per_save |
Int object denoting the number of iterations per save. Default: 10000. |
startoff |
Int object denoting the beginning iterations to jump without save and evaluation. Defalt: 0. |
verbose |
Bool object denoting whether showing the progress bar. Default: True. |
save_format |
Str object denoting the format (csv, txt, npz) to save the results. Default: txt. |
[5]:
model.train(data=[x,y,v],n_iter=100,save_format='npy',verbose=False)
The average treatment effect (ATE) is -0.0064516705
We train a CausalEGM for 100 iterations for illustration purpose, n_iter
is recommended to be 30000.
The results are saved based on the output_dir
parameter where causal_pre_at_[iter_number].[format]
denotes the individual treatment effect (ITE) in binary treatment settings and average dose-response values in continuous treatment settings.
iter_number
denotes the training iteraction and format
is determined by save_format
, which can be csv
,txt
, or npz
.
Use CausalEGM by a command-line interface (CLI)
When installing the CausalEGM by pip
, setuptools will add the console script to PATH and make it available for general use. This has advantage of being generalizeable to non-python scripts! This CLI takes a text file as input.
[6]:
!causalEGM -h
usage: causalEGM [-h] -output_dir OUTPUT_DIR -input INPUT [-dataset DATASET]
[--save-model | --no-save-model]
[--binary-treatment | --no-binary-treatment]
[-z_dims Z_DIMS [Z_DIMS ...]] [-lr LR] [-alpha ALPHA]
[-beta BETA] [-gamma GAMMA] [-g_d_freq G_D_FREQ]
[-g_units G_UNITS [G_UNITS ...]]
[-e_units E_UNITS [E_UNITS ...]]
[-f_units F_UNITS [F_UNITS ...]]
[-h_units H_UNITS [H_UNITS ...]]
[-dz_units DZ_UNITS [DZ_UNITS ...]]
[-dv_units DV_UNITS [DV_UNITS ...]]
[--use-z-rec | --no-use-z-rec] [--use-v-gan | --no-use-v-gan]
[-batch_size BATCH_SIZE] [-n_iter N_ITER]
[-startoff STARTOFF] [-batches_per_eval BATCHES_PER_EVAL]
[-save_format SAVE_FORMAT] [--save_res | --no-save_res]
[-seed SEED]
CausalEGM: A general causal inference framework by encoding generative
modeling - v0.4.0
optional arguments:
-h, --help show this help message and exit
-output_dir OUTPUT_DIR
Output directory
-input INPUT Input data file must be in csv or txt or npz format
-dataset DATASET Dataset name
--save-model, --no-save-model
whether to save model. (default: True)
--binary-treatment, --no-binary-treatment
whether use binary treatment setting. (default: True)
-z_dims Z_DIMS [Z_DIMS ...]
Latent dimensions of the four encoder outputs
e(V)_0~3.
-lr LR Learning rate for the optimizer (default: 0.0002).
-alpha ALPHA Coefficient for reconstruction loss (default: 1).
-beta BETA Coefficient for treatment and outcome MSE loss
(default: 1).
-gamma GAMMA Coefficient for gradient penalty loss (default: 10).
-g_d_freq G_D_FREQ Frequency for updating discriminators and generators
(default: 5).
-g_units G_UNITS [G_UNITS ...]
Number of units for generator/decoder network
(default: [64,64,64,64,64]).
-e_units E_UNITS [E_UNITS ...]
Number of units for encoder network (default:
[64,64,64,64,64]).
-f_units F_UNITS [F_UNITS ...]
Number of units for f network (default: [64,32,8]).
-h_units H_UNITS [H_UNITS ...]
Number of units for h network (default: [64,32,8]).
-dz_units DZ_UNITS [DZ_UNITS ...]
Number of units for discriminator network in latent
space (default: [64,32,8]).
-dv_units DV_UNITS [DV_UNITS ...]
Number of units for discriminator network in
confounder space (default: [64,32,8]).
--use-z-rec, --no-use-z-rec
Use the reconstruction for latent features. (default:
True)
--use-v-gan, --no-use-v-gan
Use the GAN distribution match for covariates.
(default: True)
-batch_size BATCH_SIZE
Batch size (default: 32).
-n_iter N_ITER Number of iterations (default: 30000).
-startoff STARTOFF Iteration for starting evaluation (default: 0).
-batches_per_eval BATCHES_PER_EVAL
Number of iterations per evaluation (default: 500).
-save_format SAVE_FORMAT
Saving format (default: txt)
--save_res, --no-save_res
Whether to save results during training. (default:
True)
-seed SEED Random seed for reproduction (default: 123).
The parameters are consistent with the Python APIs
. Here, we use a demo data for an example!
[7]:
!causalEGM -input test/demo.csv -output_dir ./ -n_iter 100 -startoff 0 -batches_per_eval 50
2023-03-20 12:57:23.620713: W tensorflow/stream_executor/cuda/cuda_driver.cc:374] A non-primary context 0x57fa5c0 for device 0 exists before initializing the StreamExecutor. The primary context is now 0. We haven't verified StreamExecutor works with that.
2023-03-20 12:57:23.620890: F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_DEVICE_UNAVAILABLE: CUDA-capable device(s) is/are busy or unavailable