| Title: | Generation and Evaluation of Synthetic Tabular Datasets |
|---|---|
| Description: | Various tools developed as part of the Open-CESP (Centre de recherche en Epidémiologie et Santé des Populations) initiative to generate and evaluate synthetic datasets for statistical disclosure control. This includes tools to investigate the risk-utility tradeoff achievable with given synthesis methods, as well as statistical tools to estimate (conditional) probability distributions. The main eventual aim is to help researchers and statisticians disseminate open research data. |
| Authors: | Rémy Chapelle [aut, cre] (ORCID: <https://orcid.org/0009-0006-3088-0354>), Centre de recherche en Epidémiologie et Santé des Populations [cph] |
| Maintainer: | Rémy Chapelle <[email protected]> |
| License: | GPL-3 |
| Version: | 0.4.0 |
| Built: | 2026-06-02 08:58:36 UTC |
| Source: | https://github.com/cran/opencesp |
Various tools developed as part of the Open-CESP (Centre de recherche en Epidémiologie et Santé des Populations) initiative to generate and evaluate synthetic datasets for statistical disclosure control. This includes tools to investigate the risk-utility tradeoff achievable with given synthesis methods, as well as statistical tools to estimate (conditional) probability distributions. The main eventual aim is to help researchers and statisticians disseminate open research data.
Maintainer: Rémy Chapelle [email protected] (ORCID)
Authors:
Rémy Chapelle [email protected] (ORCID)
Other contributors:
Centre de recherche en Epidémiologie et Santé des Populations [copyright holder]
Useful links:
adaptive_matches_prop returns the proportion of rows in an original dataframe that can be found in a synthetic dataframe.
adaptive_matches_prop(orig, synth)adaptive_matches_prop(orig, synth)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
Matching is based on Gower's distance and is adaptive: an original row is counted as matched when its nearest synthetic row is closer than its nearest other original row.
The proportion.
data(iris) orig <- iris synth <- iris[sample(1:nrow(orig), nrow(orig), replace = TRUE), ] adaptive_matches_prop(orig, synth)data(iris) orig <- iris synth <- iris[sample(1:nrow(orig), nrow(orig), replace = TRUE), ] adaptive_matches_prop(orig, synth)
ASDED returns an empirical distributions-based metric, as proposed in doi:10.29012/jpc.v1i1.568
(see equation 5 in this paper).
ASDED(orig, synth)ASDED(orig, synth)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
The ASDED metric.
data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) ASDED(orig, synth)data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) ASDED(orig, synth)
avatarize returns a dataframe containing avatars of original observations, as described in doi:10.1038/s41746-023-00771-5.
avatarize(data, k, npc)avatarize(data, k, npc)
data |
A dataframe containing original data. |
k |
The number of nearest neighbors to consider. |
npc |
The number of dimensions to use for dimensionality reduction. |
Avatarization being a stochastic procedure, this function does not necessarily give the same result for successive calls with the same parameters. The distance metric used in this implementation of the Avatar method is the Euclidean distance.
The avatar dataframe.
data(iris) orig <- iris[sapply(iris, is.numeric)] synth <- avatarize(orig, k = 5, npc = 2) plot(orig$Sepal.Width, orig$Sepal.Length) points(synth$Sepal.Width, synth$Sepal.Length, col = "red")data(iris) orig <- iris[sapply(iris, is.numeric)] synth <- avatarize(orig, k = 5, npc = 2) plot(orig$Sepal.Width, orig$Sepal.Length) points(synth$Sepal.Width, synth$Sepal.Length, col = "red")
CCM_RS returns the CrCl-RS metric, as described in doi:10.1186/s12874-020-00977-1.
This metric is calculated as the average, over categorical variables, of the ratio of the classification accuracy
on the original hold-out data to the classification accuracy on the synthetic data. The classifiers used are CARTs.
The training and hold-out sets respectively correspond to the first and second half of the original data.
CCM_RS(orig, synth)CCM_RS(orig, synth)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
The CrCl-RS metric.
data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) CCM_RS(orig, synth)data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) CCM_RS(orig, synth)
CCM_SR returns the CrCl-SR metric, as described in doi:10.1186/s12874-020-00977-1.
CCM_SR(orig, synth)CCM_SR(orig, synth)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
The CrCl-SR metric.
data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) CCM_SR(orig, synth)data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) CCM_SR(orig, synth)
cor_F1 returns a corrected F1 membership disclosure metric, as described in
doi:10.1093/jamiaopen/ooac083.
cor_F1(orig_train, orig_ho, synth, m, h, N, t_prop = nrow(orig_train)/N)cor_F1(orig_train, orig_ho, synth, m, h, N, t_prop = nrow(orig_train)/N)
orig_train |
A subset of the original dataframe to use for training. |
orig_ho |
The hold-out subset of the original dataframe, with the same variables as |
synth |
A dataframe containing data synthesized from |
m |
The desired number of records in the attack dataset. |
h |
The Hamming distance threshold to use. |
N |
The size of the source population from which original data originate. |
t_prop |
The desired proportion of records in the attack dataset originating from |
The membership disclosure metric, or NA if the F1 score is undefined.
data(iris) orig <- iris ids_train <- sample(1:nrow(orig), round(nrow(orig) * 0.8)) orig_train <- orig[ids_train, ] orig_ho <- orig[-ids_train, ] # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig_train, function(x) { x[sample(length(x), nrow(orig_train), replace = TRUE)] })) cor_F1(orig_train, orig_ho, synth, m = 20, h = 2, N = 1000)data(iris) orig <- iris ids_train <- sample(1:nrow(orig), round(nrow(orig) * 0.8)) orig_train <- orig[ids_train, ] orig_ho <- orig[-ids_train, ] # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig_train, function(x) { x[sample(length(x), nrow(orig_train), replace = TRUE)] })) cor_F1(orig_train, orig_ho, synth, m = 20, h = 2, N = 1000)
CRM_RS returns a cross-regression metric, adapted from the CrCl-RS metric described
in doi:10.1186/s12874-020-00977-1. This metric is calculated as the average, over numeric variables,
of the ratio between the mean squared prediction error obtained on the original hold-out data and the mean
squared prediction error obtained on the synthetic data. Regressions are
performed using CART. The training and hold-out sets respectively correspond to the first and second
half of the original data.
CRM_RS(orig, synth)CRM_RS(orig, synth)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
The cross-regression metric.
data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) CRM_RS(orig, synth)data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) CRM_RS(orig, synth)
CRM_SR returns a cross-regression metric, adapted from the CrCl-SR metric described
in doi:10.1186/s12874-020-00977-1.
CRM_SR(orig, synth)CRM_SR(orig, synth)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
The cross-regression metric.
data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) CRM_SR(orig, synth)data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) CRM_SR(orig, synth)
dcr returns the median distance between each synthetic unit and its nearest original neighbor. This type of metric was notably suggested
in doi:10.14778/3231751.3231757. Here, Gower's distance is used to account for mixed data types; see, for example, doi:10.48550/arXiv.2101.02481.
dcr(orig, synth, method = c("auto", "euclidean", "gower"))dcr(orig, synth, method = c("auto", "euclidean", "gower"))
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
method |
One of |
The median distance.
data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) dcr(orig, synth)data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) dcr(orig, synth)
dep_order returns the dependency order of variables determined by hierarchical clustering.
dep_order( df, blocks = rep(1, ncol(df)), score_func, bf_first = TRUE, n_bf = 1, n_cores = 1 )dep_order( df, blocks = rep(1, ncol(df)), score_func, bf_first = TRUE, n_bf = 1, n_cores = 1 )
df |
A dataframe. |
blocks |
A numeric vector containing indices of independent variable blocks. |
score_func |
A custom scoring function, taking exactly two parameters: |
bf_first |
A boolean. If |
n_bf |
An integer, the number of clusters left unmerged by the hierarchical clustering procedure, and reordered by exhaustive search (in each block of |
n_cores |
An integer, the number of CPU cores to use. |
A numeric vector representing the order of dependency.
data(iris) iris <- iris[sapply(iris, is.numeric)] score_lin <- function(x, y) { if(is.null(x)) resids <- y-mean(y) else { df <- cbind.data.frame(y, x) fit <- lm(y ~ ., data = df) resids <- fit$residuals } sd_hat <- sd(resids) return(sum(dnorm(resids, sd = sd_hat, log = TRUE))) } dep_order(iris, score_func = score_lin)data(iris) iris <- iris[sapply(iris, is.numeric)] score_lin <- function(x, y) { if(is.null(x)) resids <- y-mean(y) else { df <- cbind.data.frame(y, x) fit <- lm(y ~ ., data = df) resids <- fit$residuals } sd_hat <- sd(resids) return(sum(dnorm(resids, sd = sd_hat, log = TRUE))) } dep_order(iris, score_func = score_lin)
GCAP returns the GCAP for a given row of a dataframe, as described in doi:10.48550/arXiv.2310.06571.
GCAP(row, data, keys, rad_keys, targets, rad_targets)GCAP(row, data, keys, rad_keys, targets, rad_targets)
row |
A row of a dataframe. |
data |
The dataframe. |
keys |
A vector containing the names of the variables of the dataframe to take as keys. |
rad_keys |
A vector containing radii, one for each numeric key. The order of the radii must match the order of the variables in the dataframe. |
targets |
A vector containing the names of the variables of the dataframe to take as targets. |
rad_targets |
A vector containing radii, one for each numeric target. The order of the radii must match the order of the variables in the dataframe. |
The computed probability.
data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) target_row <- orig[1, ] keys <- colnames(orig)[1:3] targets <- colnames(orig)[4:5] radii_k <- sapply(Filter(is.numeric, orig[, keys]), function(x) { d <- diff(sort(x)) mean(pmin(c(Inf, d), c(d, Inf))) }) radii_t <- sapply(Filter(is.numeric, orig[, targets]), function(x) { d <- diff(sort(x)) mean(pmin(c(Inf, d), c(d, Inf))) }) GCAP(target_row, synth, keys, radii_k, targets, radii_t)data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) target_row <- orig[1, ] keys <- colnames(orig)[1:3] targets <- colnames(orig)[4:5] radii_k <- sapply(Filter(is.numeric, orig[, keys]), function(x) { d <- diff(sort(x)) mean(pmin(c(Inf, d), c(d, Inf))) }) radii_t <- sapply(Filter(is.numeric, orig[, targets]), function(x) { d <- diff(sort(x)) mean(pmin(c(Inf, d), c(d, Inf))) }) GCAP(target_row, synth, keys, radii_k, targets, radii_t)
get_prec returns the number of decimal places found in a numeric vector.
get_prec(vec, max = TRUE)get_prec(vec, max = TRUE)
vec |
A numeric vector. |
max |
A boolean. If |
Either a single integer or an integer vector.
data(iris) sapply(iris, get_prec)data(iris) sapply(iris, get_prec)
GTCAP computes the GTCAP for a synthetic version of a dataframe, as described in doi:10.48550/arXiv.2310.06571.
GTCAP(orig, synth, keys, rad_keys, targets, rad_targets, n_cores = 1)GTCAP(orig, synth, keys, rad_keys, targets, rad_targets, n_cores = 1)
orig |
The original dataframe. |
synth |
The synthetic dataframe. |
keys |
A vector containing names of variables of the dataframe to take as keys. |
rad_keys |
A vector containing radii, one for each numeric key. The order of the radii must match the order of the variables in the dataframe. |
targets |
A vector containing names of variables of the dataframe to take as targets. |
rad_targets |
A vector containing radii, one for each numeric target. The order of the radii must match the order of the variables in the dataframe. |
n_cores |
The number of logical processes to use for the computation. |
A list with the following elements:
The standardized mean GCAP for target synthetic rows.
A vector containing standardized GCAP values, one for each target synthetic row.
data(iris) orig <- iris[1:25, ] # we use only a sample of the data because GTCAP is computationally demanding # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) target_row <- orig[1, ] keys <- colnames(orig)[1:3] targets <- colnames(orig)[4:5] radii_k <- sapply(Filter(is.numeric, orig[, keys]), function(x) { d <- diff(sort(x)) mean(pmin(c(Inf, d), c(d, Inf))) }) radii_t <- sapply(Filter(is.numeric, orig[, targets]), function(x) { d <- diff(sort(x)) mean(pmin(c(Inf, d), c(d, Inf))) }) GTCAP(orig, synth, keys, radii_k, targets, radii_t)$meandata(iris) orig <- iris[1:25, ] # we use only a sample of the data because GTCAP is computationally demanding # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) target_row <- orig[1, ] keys <- colnames(orig)[1:3] targets <- colnames(orig)[4:5] radii_k <- sapply(Filter(is.numeric, orig[, keys]), function(x) { d <- diff(sort(x)) mean(pmin(c(Inf, d), c(d, Inf))) }) radii_t <- sapply(Filter(is.numeric, orig[, targets]), function(x) { d <- diff(sort(x)) mean(pmin(c(Inf, d), c(d, Inf))) }) GTCAP(orig, synth, keys, radii_k, targets, radii_t)$mean
hellinger_distance returns an estimate of the Hellinger distance between original and synthetic data,
as suggested (for example) in doi:10.1109/ACCESS.2022.3144765. Missing values in the input vectors are ignored.
hellinger_distance(orig, synth)hellinger_distance(orig, synth)
orig |
A vector containing original data. |
synth |
A vector containing synthetic data, of the same type as |
The estimated Hellinger distance.
data(iris) orig <- iris # simple synthesis by sampling with replacement synth <- iris[sample(1:nrow(orig), nrow(orig), replace = TRUE), ] hellinger_distance(orig$Species, synth$Species)data(iris) orig <- iris # simple synthesis by sampling with replacement synth <- iris[sample(1:nrow(orig), nrow(orig), replace = TRUE), ] hellinger_distance(orig$Species, synth$Species)
impute_rf imputes missing values in a dataframe using the random-forest method from mice.
impute_rf(df, ...)impute_rf(df, ...)
df |
A dataframe. |
... |
Additional arguments passed to |
A dataframe with imputed values.
data(airquality) colSums(is.na(airquality)) airquality <- impute_rf(airquality) colSums(is.na(airquality))data(airquality) colSums(is.na(airquality)) airquality <- impute_rf(airquality) colSums(is.na(airquality))
ind_blocks finds approximately independent blocks of variables in a dataset by hierarchical clustering.
ind_blocks(df, crit = c("dim", "n.int"), max_size = ncol(df)/2, n_cores = 1)ind_blocks(df, crit = c("dim", "n.int"), max_size = ncol(df)/2, n_cores = 1)
df |
A dataframe. |
crit |
One of |
max_size |
The maximum size of the blocks returned. If |
n_cores |
An integer, the number of CPU cores to use. |
An integer vector representing membership of each variable to the found blocks.
data(iris) iris <- iris[sapply(iris, is.numeric)] blocks <- ind_blocks(iris, max_size = 2) score_lin <- function(x, y) { if(is.null(x)) resids <- y-mean(y) else { df <- cbind.data.frame(y, x) fit <- lm(y ~ ., data = df) resids <- fit$residuals } sd_hat <- sd(resids) return(sum(dnorm(resids, sd = sd_hat, log = TRUE))) } dep_order(iris, score_func = score_lin, blocks = blocks)data(iris) iris <- iris[sapply(iris, is.numeric)] blocks <- ind_blocks(iris, max_size = 2) score_lin <- function(x, y) { if(is.null(x)) resids <- y-mean(y) else { df <- cbind.data.frame(y, x) fit <- lm(y ~ ., data = df) resids <- fit$residuals } sd_hat <- sd(resids) return(sum(dnorm(resids, sd = sd_hat, log = TRUE))) } dep_order(iris, score_func = score_lin, blocks = blocks)
interval_overlap returns the overlap of confidence intervals constructed from original and synthetic data,
as described (for example) in doi:10.1111/rssa.12358. Intervals are constructed from t distributions
in case of numeric data, and with the Clopper–Pearson method for binomial intervals in case of categorical data.
interval_overlap(orig, synth, cat = NULL, conf.level = 0.95)interval_overlap(orig, synth, cat = NULL, conf.level = 0.95)
orig |
A vector containing original data. |
synth |
A vector containing synthetic data, of the same type as |
cat |
The category to consider to construct binomial intervals in case of categorical data. |
conf.level |
The desired level of confidence for the confidence intervals. |
The computed overlap.
data(iris) orig <- iris # simple synthesis by sampling with replacement synth <- iris[sample(1:nrow(orig), nrow(orig), replace = TRUE), ] interval_overlap(orig$Sepal.Width, synth$Sepal.Width)data(iris) orig <- iris # simple synthesis by sampling with replacement synth <- iris[sample(1:nrow(orig), nrow(orig), replace = TRUE), ] interval_overlap(orig$Sepal.Width, synth$Sepal.Width)
LCM returns a log-cluster metric, as described in doi:10.1186/s12874-020-00977-1.
This metric is based on cluster analysis as proposed initially in doi:10.29012/jpc.v1i1.568.
Here we perform cluster analysis based on Gower's distance to account for mixed data types (see for example
doi:10.48550/arXiv.2101.02481.
LCM(orig, synth, n_clusters)LCM(orig, synth, n_clusters)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
n_clusters |
The number of clusters to use. |
The log-cluster metric.
data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) LCM(orig, synth, 3)data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) LCM(orig, synth, 3)
matches_prop returns the proportion of rows in an original dataframe that can be found in a synthetic dataframe.
matches_prop(orig, synth, method = c("exact", "gower"), thr = NULL)matches_prop(orig, synth, method = c("exact", "gower"), thr = NULL)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
method |
One of |
thr |
A numeric threshold used to decide whether two rows match when |
Duplicates within each dataframe are treated as a single record when method = "exact".
If thr = NULL, the threshold is chosen automatically as the average
nearest-neighbour distance between rows of orig. It represents the typical
spacing between observations in the original data.
The proportion.
data(iris) orig <- iris # simple synthesis by sampling with replacement synth <- iris[sample(1:nrow(orig), nrow(orig), replace = TRUE), ] matches_prop(orig, synth)data(iris) orig <- iris # simple synthesis by sampling with replacement synth <- iris[sample(1:nrow(orig), nrow(orig), replace = TRUE), ] matches_prop(orig, synth)
mean_hellinger returns the mean estimated Hellinger distance across all variables of original and synthetic datasets. This is close to doi:10.1093/jamia/ocaa249, but using the mean instead of the median.
mean_hellinger(orig, synth)mean_hellinger(orig, synth)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
The average estimated Hellinger distance.
data(iris) orig <- iris # simple synthesis by sampling with replacement synth <- iris[sample(1:nrow(orig), nrow(orig), replace = TRUE), ] mean_hellinger(orig, synth)data(iris) orig <- iris # simple synthesis by sampling with replacement synth <- iris[sample(1:nrow(orig), nrow(orig), replace = TRUE), ] mean_hellinger(orig, synth)
outlier_coverage returns the outlier coverage, which is originally a privacy metric that compares two marginal distributions. Here its average value
across all variables in the original data is returned .
outlier_coverage(orig, synth)outlier_coverage(orig, synth)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
The value of the metric.
SDMetrics documentation: OutlierCoverage
data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) outlier_coverage(orig, synth)data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) outlier_coverage(orig, synth)
outlier_learning_factor returns the outlier learning factor, a privacy metric which aims at quantifying the tendency of a synthesizer
to produce observations that lie in low-density regions of the original data. The metric is defined as the average distance between outliers
(as defined by the user) and their nearest synthetic neighbor.
outlier_learning_factor(orig, outlier_ids, synth)outlier_learning_factor(orig, outlier_ids, synth)
orig |
A dataframe containing original data. |
outlier_ids |
A numeric vector containing the indices of the outliers in |
synth |
A dataframe containing synthetic data, with the same variables as |
The distance used is Gower's distance.
The value of the metric.
data(iris) orig <- iris[sapply(iris, is.numeric)] # find the 5% most atypical observations by a nearest-neighbor rule D <- as.matrix(dist(orig, method = "euclidean")) score <- apply(D, 1, function(x) sort(x)[2]) idx_outliers <- order(score, decreasing = TRUE)[seq_len(ceiling(0.05 * nrow(orig)))] # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) outlier_learning_factor(orig, idx_outliers, synth)data(iris) orig <- iris[sapply(iris, is.numeric)] # find the 5% most atypical observations by a nearest-neighbor rule D <- as.matrix(dist(orig, method = "euclidean")) score <- apply(D, 1, function(x) sort(x)[2]) idx_outliers <- order(score, decreasing = TRUE)[seq_len(ceiling(0.05 * nrow(orig)))] # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) outlier_learning_factor(orig, idx_outliers, synth)
PCD_cat returns a metric adapted from the pairwise correlation difference suggested in
doi:10.1186/s12874-020-00977-1. Contrary to the original proposal, the matrices
from which the Frobenius norm is returned contain Cramer's V rather than Pearson
correlation coefficients. Consequently, the metric is suitable to nominal data, and is computed
on subsets of the input dataframes represented by all their categorical variables.
PCD_cat(orig, synth)PCD_cat(orig, synth)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
The adapted pairwise correlation difference.
data(iris) # categorize numeric variables for the example iris[] <- lapply(iris, function(x) if(is.numeric(x)) cut(x, 2) else x) orig <- iris # simple synthesis by sampling with replacement synth <- iris[sample(1:nrow(orig), nrow(orig), replace = TRUE), ] PCD_cat(orig, synth)data(iris) # categorize numeric variables for the example iris[] <- lapply(iris, function(x) if(is.numeric(x)) cut(x, 2) else x) orig <- iris # simple synthesis by sampling with replacement synth <- iris[sample(1:nrow(orig), nrow(orig), replace = TRUE), ] PCD_cat(orig, synth)
PCD_num returns the pairwise correlation difference, as defined in doi:10.1186/s12874-020-00977-1.
The metric is computed on subsets of the input dataframes represented by all their numeric variables.
PCD_num(orig, synth)PCD_num(orig, synth)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
The pairwise correlation difference.
data(iris) orig <- iris # simple synthesis by sampling with replacement synth <- iris[sample(1:nrow(orig), nrow(orig), replace = TRUE), ] PCD_num(orig, synth)data(iris) orig <- iris # simple synthesis by sampling with replacement synth <- iris[sample(1:nrow(orig), nrow(orig), replace = TRUE), ] PCD_num(orig, synth)
pgb fits a PGB model.
pgb(formula, data, pvalid = 0.2, valid_data = NULL, M = 50, ...)pgb(formula, data, pvalid = 0.2, valid_data = NULL, M = 50, ...)
formula |
A formula, with a response but no interaction terms. |
data |
A dataframe, the training data. |
pvalid |
A number between 0 and 1, the proportion of observations that are sampled to form a validation set for early stopping. |
valid_data |
A dataframe (with the same structure as the training data) used as a validation set for early stopping. |
M |
An integer, the number of target quantiles. |
... |
Additional arguments passed to |
Note that if valid_data is provided, then pvalid must be 0.
An object of class "pgb".
data(iris) train_data <- iris[1:100, ] valid_data <- iris[101:nrow(iris), ] # train with a specified validation set fit <- pgb(Sepal.Width ~ ., data = train_data, valid_data = valid_data, pvalid = 0) # randomly select 20% of observations to form a validation set fit <- pgb(Sepal.Width ~ ., data = iris, pvalid = 0.2) # train without early stopping fit <- pgb(Sepal.Width ~ ., data = iris, pvalid = 0, ntrees = 50)data(iris) train_data <- iris[1:100, ] valid_data <- iris[101:nrow(iris), ] # train with a specified validation set fit <- pgb(Sepal.Width ~ ., data = train_data, valid_data = valid_data, pvalid = 0) # randomly select 20% of observations to form a validation set fit <- pgb(Sepal.Width ~ ., data = iris, pvalid = 0.2) # train without early stopping fit <- pgb(Sepal.Width ~ ., data = iris, pvalid = 0, ntrees = 50)
pgb_control encapsulates the hyperparameters to be used for fitting PGB models.
pgb_control( ntrees = 10000, early_stopping_rounds = 100, maxdepth = 4, minbucket = 5, eta = 0.02, subsample = 0.5, maxbin = 256 )pgb_control( ntrees = 10000, early_stopping_rounds = 100, maxdepth = 4, minbucket = 5, eta = 0.02, subsample = 0.5, maxbin = 256 )
ntrees |
An integer, the maximum number of trees to use. |
early_stopping_rounds |
An integer, the maximum number of iterations allowed without improvement of the validation error. |
maxdepth |
An integer, the maximum depth of the trees. |
minbucket |
An integer, the minimum number of training observations in each leaf of the trees. |
eta |
A positive number, the learning rate of the procedure. |
subsample |
A number between 0 and 1, the proportion of training observations to sample at each iteration. |
maxbin |
An integer, the maximum number of discrete bins to bucket continuous features. In the current implementation, this is also the number of observations sampled at each iteration to perform the line search. |
A list with the options.
data(iris) controls <- pgb_control(ntrees = 1) # to fit a one-tree PGB model # the following two lines are equivalent fit <- do.call(pgb_cvh, c(list(formula = Sepal.Width ~ ., data = iris), controls)) fit <- pgb_cvh(Sepal.Width ~ ., data = iris, ntrees = 1)data(iris) controls <- pgb_control(ntrees = 1) # to fit a one-tree PGB model # the following two lines are equivalent fit <- do.call(pgb_cvh, c(list(formula = Sepal.Width ~ ., data = iris), controls)) fit <- pgb_cvh(Sepal.Width ~ ., data = iris, ntrees = 1)
pgb_cvh fits a PGB model, with its hyperparameters determined by cross-validation
pgb_cvh(formula, data, nfolds = 5, select_h = c("greedy", "cv", "none"), ...)pgb_cvh(formula, data, nfolds = 5, select_h = c("greedy", "cv", "none"), ...)
formula |
A formula, with a response but no interaction terms. |
data |
A dataframe, the training data. |
nfolds |
An integer, the number of folds for the cross-validation procedure. |
select_h |
One of |
... |
Additional arguments to |
An object of class "pgb".
data(iris) fit <- pgb_cvh(Sepal.Width ~ ., data = iris)data(iris) fit <- pgb_cvh(Sepal.Width ~ ., data = iris)
pMSE returns the propensity score mean-squared error obtained from an original and synthetic
dataset. For each variable, numeric values in the synthetic data are replaced with the nearest value
in the original data. This prevents tree-based classifiers (used to estimate the propensity scores) to
produce undesirable splits based only on the marginal distributions.
pMSE(orig, synth, method = c("cart", "gbm", "glm"), formula = syn ~ ., ...)pMSE(orig, synth, method = c("cart", "gbm", "glm"), formula = syn ~ ., ...)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
method |
One of |
formula |
A formula, used to specify the covariates and interaction terms to be used to estimate the propensity scores. The dependent variable is named |
... |
Additional arguments to be passed to the function used to estimate the propensity scores. |
The metric value.
data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) pMSE(orig, synth, method = "glm")data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) pMSE(orig, synth, method = "glm")
pMSE_cp returns the mean complexity parameter obtained by tuning decision trees (with rpart::rpart()) to estimate propensity scores
from a list of synthetic dataframes. Typically, this would be used to choose the complexity parameter for computing pMSE values.
pMSE_cp(orig, synth_l, xval = 25, ...)pMSE_cp(orig, synth_l, xval = 25, ...)
orig |
A dataframe containing original data. |
synth_l |
A list of dataframes containing synthetic data, each with the same variables as |
xval |
Number of cross-validations to perform. |
... |
Additional arguments to be passed to the |
The mean complexity value.
data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) cp <- pMSE_cp(orig, list(synth)) pMSE(orig, synth, method = "cart", cp = cp)data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) cp <- pMSE_cp(orig, list(synth)) pMSE(orig, synth, method = "cart", cp = cp)
predict_cde_pgb predicts smooth conditional densities from a trained PGB model.
predict_cde_pgb(object, newdata, y_grid)predict_cde_pgb(object, newdata, y_grid)
object |
An object of class |
newdata |
A dataframe, structured like the training data (with or without the response variable). |
y_grid |
A vector of increasing numeric variables, the grid on which the densities will be evaluated. |
A matrix of predictions, where each line corresponds to an observation and each column to a density value on the provided grid.
data(iris) fit <- pgb_cvh(Sepal.Width ~ Sepal.Length, data = iris) y_grid <- seq(from = min(iris$Sepal.Width), to = max(iris$Sepal.Width), length.out = 200) ids_x <- c(25, 80, 125) cde <- predict_cde_pgb(fit, iris[ids_x, ], y_grid = y_grid) x0 <- iris[ids_x, "Sepal.Length"] s <- diff(range(iris$Sepal.Length)) / 8 / max(cde) plot(iris$Sepal.Length, iris$Sepal.Width, xlab = "Sepal.Length", ylab = "Sepal.Width") for(i in seq_len(nrow(cde))) lines(x0[i] + s * cde[i, ], y_grid, lwd = 2)data(iris) fit <- pgb_cvh(Sepal.Width ~ Sepal.Length, data = iris) y_grid <- seq(from = min(iris$Sepal.Width), to = max(iris$Sepal.Width), length.out = 200) ids_x <- c(25, 80, 125) cde <- predict_cde_pgb(fit, iris[ids_x, ], y_grid = y_grid) x0 <- iris[ids_x, "Sepal.Length"] s <- diff(range(iris$Sepal.Length)) / 8 / max(cde) plot(iris$Sepal.Length, iris$Sepal.Width, xlab = "Sepal.Length", ylab = "Sepal.Width") for(i in seq_len(nrow(cde))) lines(x0[i] + s * cde[i, ], y_grid, lwd = 2)
predict_cde_pgb_raw predicts histogram-like conditional densities (i.e. with no smoothing performed) from a trained PGB model.
predict_cde_pgb_raw(object, newdata, y_grid)predict_cde_pgb_raw(object, newdata, y_grid)
object |
An object of class |
newdata |
A dataframe, structured like the training data (with or without the response variable). |
y_grid |
A vector of increasing numeric variables, the grid on which the densities will be evaluated. |
A matrix of predictions, where each line corresponds to an observation and each column to a density value on the provided grid.
data(iris) fit <- pgb_cvh(Sepal.Width ~ Sepal.Length, data = iris, select_h = "none") y_grid <- seq(from = min(iris$Sepal.Width), to = max(iris$Sepal.Width), length.out = 200) ids_x <- c(25, 80, 125) cde <- predict_cde_pgb_raw(fit, iris[ids_x, ], y_grid = y_grid) x0 <- iris[ids_x, "Sepal.Length"] s <- diff(range(iris$Sepal.Length)) / 8 / max(cde) plot(iris$Sepal.Length, iris$Sepal.Width, xlab = "Sepal.Length", ylab = "Sepal.Width") for(i in seq_len(nrow(cde))) lines(x0[i] + s * cde[i, ], y_grid, lwd = 2)data(iris) fit <- pgb_cvh(Sepal.Width ~ Sepal.Length, data = iris, select_h = "none") y_grid <- seq(from = min(iris$Sepal.Width), to = max(iris$Sepal.Width), length.out = 200) ids_x <- c(25, 80, 125) cde <- predict_cde_pgb_raw(fit, iris[ids_x, ], y_grid = y_grid) x0 <- iris[ids_x, "Sepal.Length"] s <- diff(range(iris$Sepal.Length)) / 8 / max(cde) plot(iris$Sepal.Length, iris$Sepal.Width, xlab = "Sepal.Length", ylab = "Sepal.Width") for(i in seq_len(nrow(cde))) lines(x0[i] + s * cde[i, ], y_grid, lwd = 2)
predict.pgb predicts quantile values from a trained PGB model.
## S3 method for class 'pgb' predict(object, newdata, project = TRUE, ...)## S3 method for class 'pgb' predict(object, newdata, project = TRUE, ...)
object |
An object of class |
newdata |
A dataframe, structured like the training data (with or without the response variable). |
project |
A boolean, whether to correct for quantile crossings with an isotonic regression. |
... |
Further arguments passed to or from other methods. |
A matrix of predictions, where each line corresponds to an observation and each column to a quantile level.
data(iris) fit <- pgb_cvh(Sepal.Width ~ Petal.Length, data = iris) preds <- predict(fit, iris) plot(iris$Petal.Length, iris$Sepal.Width) matlines(iris$Petal.Length[order(iris$Petal.Length)], preds[order(iris$Petal.Length), ], type = "l", lty = 1)data(iris) fit <- pgb_cvh(Sepal.Width ~ Petal.Length, data = iris) preds <- predict(fit, iris) plot(iris$Petal.Length, iris$Sepal.Width) matlines(iris$Petal.Length[order(iris$Petal.Length)], preds[order(iris$Petal.Length), ], type = "l", lty = 1)
resample samples from a vector if it contains more than one element, and otherwise returns its only element.
resample(vec, ...)resample(vec, ...)
vec |
A vector. |
... |
Additional arguments passed to |
A sampled value or vector, depending on ....
vec <- 1:5 # sample and resample coincide when given a vector of length > 1 set.seed(1234) sample(vec, 1) set.seed(1234) resample(vec, 1) vec <- c(5) # with vectors of length 1, sample can return values that are not in the vector set.seed(1234) sample(vec, 1) # sample(c(5), 1) is equivalent to sample(1:5, 1) set.seed(1234) resample(vec, 1) # no problem with resamplevec <- 1:5 # sample and resample coincide when given a vector of length > 1 set.seed(1234) sample(vec, 1) set.seed(1234) resample(vec, 1) vec <- c(5) # with vectors of length 1, sample can return values that are not in the vector set.seed(1234) sample(vec, 1) # sample(c(5), 1) is equivalent to sample(1:5, 1) set.seed(1234) resample(vec, 1) # no problem with resample
Round synthetic numeric variables to the precision observed in the original data
round_synth(orig, synth)round_synth(orig, synth)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
A dataframe whose numeric columns are rounded with the precision observed in orig.
data(iris) orig <- iris[sapply(iris, is.numeric)] # synthesis by sampling from a multivariate Gaussian distribution # with parameters estimated from the data synth <- setNames( as.data.frame( sweep( matrix(rnorm(150 * ncol(orig)), 150) %*% chol(cov(orig) * (nrow(orig) - 1) / nrow(orig)), 2, colMeans(orig), "+" ) ), names(orig) ) sapply(synth, get_prec) synth <- round_synth(orig, synth) sapply(synth, get_prec)data(iris) orig <- iris[sapply(iris, is.numeric)] # synthesis by sampling from a multivariate Gaussian distribution # with parameters estimated from the data synth <- setNames( as.data.frame( sweep( matrix(rnorm(150 * ncol(orig)), 150) %*% chol(cov(orig) * (nrow(orig) - 1) / nrow(orig)), 2, colMeans(orig), "+" ) ), names(orig) ) sapply(synth, get_prec) synth <- round_synth(orig, synth) sapply(synth, get_prec)
tsAUC computes the AUC and associated p value from a two-sample test according to doi:10.1145/3411408.3411422.
tsAUC(orig, synth, ntreeTry = 50, ntree = 500, ...)tsAUC(orig, synth, ntreeTry = 50, ntree = 500, ...)
orig |
A dataframe containing original data. |
synth |
A dataframe containing synthetic data, with the same variables as |
ntreeTry |
An integer, the number of trees used at the tuning step. |
ntree |
An integer, the number of trees used at the training step. |
... |
options to be given to |
A list with the following elements:
The estimated area under the ROC curve.
The p value associated with a one-sided Wilcoxon test.
data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) tsAUC(orig, synth)$p.valuedata(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) tsAUC(orig, synth)$p.value
univ_att_prob computes the univariate correct attribution probability from given parameters, as described in doi:10.48550/arXiv.2310.06571.
univ_att_prob(row, data, rad)univ_att_prob(row, data, rad)
row |
The target row of the dataframe. |
data |
A dataframe. |
rad |
A vector containing radii, one for each numeric variable of the dataframe. The order of the radii must match the order of the variables in the dataframe. |
When the input dataframe contains only categorical variables, the value returned is the proportion of rows in the dataframe having the same combination of values as the input row. When the input dataframe also contains numeric variables, this value is weighted according to the proximity between these variables in the input row vs. each other row of the dataframe. The weights are given by radii passed as parameters.
The computed probability.
data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) target_row <- orig[1, ] radii <- sapply(Filter(is.numeric, orig), function(x) { d <- diff(sort(x)) mean(pmin(c(Inf, d), c(d, Inf))) }) univ_att_prob(target_row, synth, radii)data(iris) orig <- iris # simple synthesis by sampling from the product of the marginal distributions synth <- as.data.frame(lapply(orig, function(x) x[sample(length(x), nrow(orig), replace = TRUE)])) target_row <- orig[1, ] radii <- sapply(Filter(is.numeric, orig), function(x) { d <- diff(sort(x)) mean(pmin(c(Inf, d), c(d, Inf))) }) univ_att_prob(target_row, synth, radii)