Title: | Goodness-of-Fit Tests using Kernelized Stein Discrepancy |
---|---|
Description: | An adaptation of Kernelized Stein Discrepancy, this package provides a goodness-of-fit test of whether a given i.i.d. sample is drawn from a given distribution. It works for any distribution once its score function (the derivative of log-density) can be provided. This method is based on "A Kernelized Stein Discrepancy for Goodness-of-fit Tests and Model Evaluation" by Liu, Lee, and Jordan, available at <arXiv:1602.03253>. |
Authors: | Min Hyung Kang [aut, cre], Qiang Liu [aut] |
Maintainer: | Min Hyung Kang <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.1 |
Built: | 2025-03-08 03:45:39 UTC |
Source: | https://github.com/minhyung-kang/ksd |
Tests 1-dimensional Gaussian Mixture Models.
demo_gmm()
demo_gmm()
Tests multidimensional Gaussian Mixture Models.
demo_gmm_multi()
demo_gmm_multi()
We fit a Gaussian Mixture Model for a given dataset (Fisher's Iris), and we compute the KSD P-value on the hold-out test dataset. User may tune the parameters and observe the change in results. Reports average of p-values obtained during each k-fold. It also plots the contour for each k-fold iteration if only 2 dimensions of data are used. If a vector is specified for nClust, the code tries each element as the number of clusters and reports the optimal parameter by choosing one with highest p-value.
demo_iris(cols = c(1, 2), nClust = 3, kfold = 5)
demo_iris(cols = c(1, 2), nClust = 3, kfold = 5)
cols |
: Columns of iris data set to use. If 2 dimensions, plots the contour for each k-fold. |
nClust |
: Number of clusters want to estimate with If vector, use each element as number of clusters and reports the optimal number. |
kfold |
: Number of k to use for k-fold |
We generate a standard normal distribution, and add varying gaussian noise to this dataset and see the change in pvalues.
demo_normal_performance()
demo_normal_performance()
We generate a gamma distribution with given parameters, and add gaussian noise to this dataset. We then compute the score of each dataset for the original true distribution.
demo_simple_gamma( trueshape = 10, truescale = 3, noisemu = 5, noisesd = 2, n = 100 )
demo_simple_gamma( trueshape = 10, truescale = 3, noisemu = 5, noisesd = 2, n = 100 )
trueshape |
shape of true gamma distribution |
truescale |
scale of true gamma distribution |
noisemu |
mean of gaussian noise to add |
noisesd |
standard deviation of gaussian noise to add |
n |
number of samples to generate |
We generate a gaussian distribution with given parameters, and add noise to this dataset. We then compute the score of each dataset for the original true distribution.
demo_simple_gaussian(truemu = 5, truesd = 1, noisemu = 0, noisesd = 2, n = 100)
demo_simple_gaussian(truemu = 5, truesd = 1, noisemu = 0, noisesd = 2, n = 100)
truemu |
mean of true distribution |
truesd |
standard deviation of true distribution |
noisemu |
mean of gaussian noise to add |
noisesd |
standard deviation of gaussian noise to add |
n |
number of samples to generate |
Returns a Gaussian Mixture Model
gmm(nComp = NULL, mu = NULL, sigma = NULL, weights = NULL, d = NULL)
gmm(nComp = NULL, mu = NULL, sigma = NULL, weights = NULL, d = NULL)
nComp |
(scalar) : number of components |
mu |
(d by k): mean of each component |
sigma |
(d by d by k): covariance of each component |
weights |
(1 by k) : mixing weight of each proportion (optional) |
d |
: number of dimensions of vector (optional) |
model : A Gaussian Mixture Model generated from the given parameters
# Default 1-d gaussian mixture model model <- gmm() # 1-d Gaussian mixture model with 3 components model <- gmm(nComp = 3) # 3-d Gaussian mixture model with 3 components, with specified mu,sigma and weights mu <- matrix(c(1,2,3,2,3,4,5,6,7),ncol=3) sigma <- array(diag(3),c(3,3,3)) model <- gmm(nComp = 3, mu = mu, sigma=sigma, weights = c(0.2,0.4,0.4), d = 3)
# Default 1-d gaussian mixture model model <- gmm() # 1-d Gaussian mixture model with 3 components model <- gmm(nComp = 3) # 3-d Gaussian mixture model with 3 components, with specified mu,sigma and weights mu <- matrix(c(1,2,3,2,3,4,5,6,7),ncol=3) sigma <- array(diag(3),c(3,3,3)) model <- gmm(nComp = 3, mu = mu, sigma=sigma, weights = c(0.2,0.4,0.4), d = 3)
Estimate kernelized Stein discrepancy (KSD) using U-statistics,
and use bootstrap to test H0: is drawn from
(via KSD=0).
KSD(x, score_function, kernel = "rbf", width = -1, nboot = 1000)
KSD(x, score_function, kernel = "rbf", width = -1, nboot = 1000)
x |
Sample of size Num_Instance x Num_Dimension |
score_function |
( |
kernel |
Type of kernel (default = 'rbf') |
width |
Bandwidth of the kernel (when width = -1 or 'median', set it to be the median distance between data points) |
nboot |
Bootstrap sample size |
A list which includes the following variables :
"ksd" : Estimated Kernelized Stein Discrepancy (KSD)
"p" : p-Value for rejecting the null hypothesis that ksd = 0
"bootstrapSamples" : the bootstrap sample
"info": other information, including : bandwidth, M, nboot, ksd_V
# Pass in a dataset generated by Gaussian distribution, # use pryr package to pass in score function model <- gmm() X <- rgmm(model, n=100) score_function = pryr::partial(scorefunctiongmm, model=model) result <- KSD(X,score_function=score_function) # Pass in a dataset generated by Gaussian distribution, # pass in computed score rather than score function model <- gmm() X <- rgmm(model, n=100) score_function = scorefunctiongmm(model=model, X=X) result <- KSD(X,score_function=score_function) # Pass in a dataset generated by Gaussian distribution, # pass in computed score rather than score function # Use median_heuristic by specifying width to be -2.0 model <- gmm() X <- rgmm(model, n=100) score_function = pryr::partial(scorefunctiongmm, model=model) result <- KSD(X,score_function=score_function, 'rbf',-2.0) # Pass in a dataset generated by specific Gaussian distribution, # pass in computed score rather than score function # Use median_heuristic by specifying width to be -2.0 model <- gmm() X <- rgmm(model, n=100) score_function = pryr::partial(scorefunctiongmm, model=model) result <- KSD(X,score_function=score_function, 'rbf',-2.0)
# Pass in a dataset generated by Gaussian distribution, # use pryr package to pass in score function model <- gmm() X <- rgmm(model, n=100) score_function = pryr::partial(scorefunctiongmm, model=model) result <- KSD(X,score_function=score_function) # Pass in a dataset generated by Gaussian distribution, # pass in computed score rather than score function model <- gmm() X <- rgmm(model, n=100) score_function = scorefunctiongmm(model=model, X=X) result <- KSD(X,score_function=score_function) # Pass in a dataset generated by Gaussian distribution, # pass in computed score rather than score function # Use median_heuristic by specifying width to be -2.0 model <- gmm() X <- rgmm(model, n=100) score_function = pryr::partial(scorefunctiongmm, model=model) result <- KSD(X,score_function=score_function, 'rbf',-2.0) # Pass in a dataset generated by specific Gaussian distribution, # pass in computed score rather than score function # Use median_heuristic by specifying width to be -2.0 model <- gmm() X <- rgmm(model, n=100) score_function = pryr::partial(scorefunctiongmm, model=model) result <- KSD(X,score_function=score_function, 'rbf',-2.0)
Calculates the likelihood for a given dataset for a GMM
likelihoodgmm(model = NULL, X = NULL)
likelihoodgmm(model = NULL, X = NULL)
model |
: The Gaussian Mixture Model |
X |
(n by d): The dataset of interest, where n is the number of samples and d is the dimension |
P (n by k) : The likelihood of each dataset belonging to each of the k component
# compute likelihood for a default 1-d gaussian mixture model # and dataset generated from it model <- gmm() X <- rgmm(model) p <- likelihoodgmm(model=model, X=X)
# compute likelihood for a default 1-d gaussian mixture model # and dataset generated from it model <- gmm() X <- rgmm(model) p <- likelihoodgmm(model=model, X=X)
Returns a perturbed model of given GMM
perturbgmm(model = NULL)
perturbgmm(model = NULL)
model |
: The base Gaussian Mixture Model |
perturbedModel : Perturbed model with added noise to the supplied GMM
#Add noise to default 1-d gaussian mixture model model <- gmm() noisymodel <- perturbgmm(model)
#Add noise to default 1-d gaussian mixture model model <- gmm() noisymodel <- perturbgmm(model)
Plots histogram for 1-d GMM given the dataset
plotgmm(data, mu = NULL)
plotgmm(data, mu = NULL)
data |
(n by 1): The dataset of interest, where n is the number of samples. |
mu |
: True mean of the GMM (optional) |
# Plot pdf histogram for a given dataset model <- gmm() X <- rgmm(model) plotgmm(data=X) # Plot pdf histogram for a given dataset, with lines that indicate the mean model <- gmm() mu <- model$mu X <- rgmm(model) plotgmm(data=X, mu=mu)
# Plot pdf histogram for a given dataset model <- gmm() X <- rgmm(model) plotgmm(data=X) # Plot pdf histogram for a given dataset, with lines that indicate the mean model <- gmm() mu <- model$mu X <- rgmm(model) plotgmm(data=X, mu=mu)
Calculates the posterior probability for a given dataset for a GMM
posteriorgmm(model = NULL, X = NULL)
posteriorgmm(model = NULL, X = NULL)
model |
: The Gaussian Mixture Model |
X |
(n by d): The dataset of interest, where n is the number of samples and d is the dimension |
P (n by k) : The posterior probabilty of each dataset belonging to each of the k component
# compute posterior probability for a default 1-d gaussian mixture model # and dataset generated from it model <- gmm() X <- rgmm(model) p <- posteriorgmm(model=model, X=X)
# compute posterior probability for a default 1-d gaussian mixture model # and dataset generated from it model <- gmm() X <- rgmm(model) p <- posteriorgmm(model=model, X=X)
Generates dataset from Gaussian Mixture Model
rgmm(model = NULL, n = 100)
rgmm(model = NULL, n = 100)
model |
: Gaussian Mixture Model defined by gmm() |
n |
: number of samples desired |
data (n by d): Random dataset generated from given the Gaussian Mixture Model
Requires library mvtnorm
#Generate 100 samples from default gaussian mixture model model <- gmm() X <- rgmm(model) #Generate 300 samples from 3-d gaussian mixture model model <- gmm(d=3) X <- rgmm(model,n=300)
#Generate 100 samples from default gaussian mixture model model <- gmm() X <- rgmm(model) #Generate 300 samples from 3-d gaussian mixture model model <- gmm(d=3) X <- rgmm(model,n=300)
Score function for given GMM : calculates score function dlogp(x)/dx for a given Gaussian Mixture Model
scorefunctiongmm(model = NULL, X = NULL)
scorefunctiongmm(model = NULL, X = NULL)
model |
: The Gaussian Mixture Model |
X |
(n by d): The dataset of interest, where n is the number of samples and d is the dimension |
y : The score computed by the given function
# Compute score for a given gaussianmixture model and dataset model <- gmm() X <- rgmm(model) score <- scorefunctiongmm(model=model, X=X)
# Compute score for a given gaussianmixture model and dataset model <- gmm() X <- rgmm(model) score <- scorefunctiongmm(model=model, X=X)