Back to the main page.

This page illustrates the data analysis from Section 6. The output of the code below corresponds to the first column of Table 5. We set seed to the value 123456 to get the same numerical values as presented in the paper.

set.seed(123456) 

Phenome data

Consider the phoneme data from library fds that were formed by selecting five phonemes for classification based on digitized speech. The paper (Hlávka, Hlubinka, and Koňasová 2022) investigated data corresponding to phonemes aa and ao, and observed statistically significant differences in the distribution of the two functional random samples (using a test comparing empirical characteristic functions) in the first three of five segments (with sample size 15 in each sample).

In this example we use the first segment and data with sample size \(n=15\) in each sample, i.e. the same data as in the first column of Table 5 of the paper.

library(fds)
data(aa)
data(ao)  
n=15
X1=t(aa$y[1:30,1:n])
X2=t(ao$y[1:30,1:n])

We create a combined sample dat and also combined centered sample dcn, which will be used for computation of covariances.

## combined sample:
dat = rbind(X1,X2) 
## combined centered sample:
dcn =rbind(scale(X1,center=TRUE,scale=FALSE),scale(X2,center=TRUE,scale=FALSE)) 

## group indicator:
grp = c(rep(1,n),rep(2,n))

The aim is to test the null hypothesis \[ H_0: \text{ the distribution of the two samples is the same.} \] It will be tested via simplified partial hypotheses \[ \widetilde{H}_{0}^M: \mu_1=\mu_2, \quad \widetilde{H}_0^C: C_1=C_2, \] where \(\mu_i\), \(i=1,2\), are mean functions and \(C_i\), \(i=1,2\), are the covariance operators.

Partial permutation tests

The hypothesis \(\widetilde{H}_{0}^M\) of equality of mean functions can be tested using a test statistic Fmax of (Górecki and Smaga 2019). The hypothesis \(\widetilde{H}_0^C\) of equality of covariance operators is tested by the test statistic SQ based on the square-root distance, (Cabassi et al. 2017). Significance of both is evaluated by permutation tests.

The Fmax test statistic can be computed using function fanova.tests from library fdANOVA:

library(fdANOVA)

fanfd=fanova.tests(t(dat),group.label = grp,parallel = FALSE,test = "Fmaxb",
                   params = list(paramFmaxb=1))$Fmaxb$statFmax 

The SQ statistic used to be part of package fdcov, but this library is, unfortunately, no longer available on cran. Hence, the necessary functions for computation of SQ are provided in code fdcov.R:

source("fdcov.R")

# test statistic SQ:
covsq=distCov(cov(dcn[grp==1,],use='pairwise'),cov(dcn[grp==2,],use='pairwise'),"sq")

The permutation p-values for Fmax and SQ are computed from \(B=1999\) permutation replicas. First, we calculate \(B\) replicas of each test statistic:

B=1999


fanfd.b=rep(NA,B)
covsq.b=rep(NA,B)
for (b in 1:B) {
    ## the permutation
    s=sample.int(2*n)
    ## reordered data set
    dat.b=dat[s,]
    dcn.b=dcn[s,]
    ## statistics for permuted data set
    covsq.b[b]=distCov(cov(dcn.b[grp==1,],use='pairwise'),cov(dcn.b[grp==2,],use='pairwise'),"sq")
    fanfd.b[b]=fanova.tests(t(dat.b),group.label = grp,parallel = FALSE,test = "Fmaxb",params = list(paramFmaxb=1))$Fmaxb$statFmax 
}

Since both test statistics Fmax and SQ take only positive values, the resulting p-values are obtained using the usual formula \[\tag{1} p_{j}=(B+1)^{-1} \Bigl\{1+\sum_{b=1}^B \boldsymbol{1}(T_{j,b}\geq T_{j,0})\Bigr\}, \] where \(T_{j,0}\) is the value of the test statistic computed from the data (\(j=1\) for SQ and \(j=2\) for Fmax), and \(T_{j,b}\), \(b=1,\dots,B\), are the permutation replicas. These numbers are the same as presented in the first two rows of the first column of Table 5 of the paper:

pval.cov=(1+sum(covsq.b>covsq))/(B+1)

pval.fmx=(1+sum(fanfd.b>fanfd))/(B+1)

print(paste("p-value (Fmax, equality of means):",pval.fmx))
## [1] "p-value (Fmax, equality of means): 0.443"
print(paste("p-value (SQ, equality of covariance):",pval.cov))
## [1] "p-value (SQ, equality of covariance): 0.039"

Permutation tests based on combining functions

The joint null hypothesis \[ \widetilde{H}_0^{CM}: \mu_1=\mu_2 \ \& \ C_1=C_2 \] of equality of the mean functions and the covariance operator can be tested using the multivariate permutation test based on nonparametric combining functions. We calculate first the two vectors of partial p-values (based on \(B=1999\) permutations), as described in Section 2.1 of the paper, using formula \[ p_{j,b}=\frac{1}{B+1}\left\{1+\sum_{s=1}^B \boldsymbol{1}(T_{j,s}\geq T_{j,b})\right\}. \] Using the fact that \(\sum_{s=1}^B \boldsymbol{1}(T_{j,s}\geq T_{j,b})\) is equal to \(B-R_b+1\), where \(R_b\) is the rank of \(T_{j,b}\) within sample \(T_{j,1},\dots,T_{j,B}\), the formula can be expressed as follows:

pval.cov.p=(2+B-rank((abs(covsq.b))))/(B+1) 
pval.fmx.p=(2+B-rank((abs(fanfd.b))))/(B+1)

pval1=pval.cov
pval2=pval.fmx
pval1.p=pval.cov.p
pval2.p=pval.fmx.p

For each permutation step \(b\in\{1,\dots,B\}\), we have a vector of partial p-values \(\mathbf{p}_b=(p_{1,b},p_{2,b})^\top\), where \(p_{1,b}\) corresponds to SQ and \(p_{2b}\) corresponds to Fmax. Similarly, the vector of partial p-values for the data is \(\mathbf{p}=(p_1,p_2)^\top\), where \(p_1\) and \(p_2\) are given in (1). The vector pval1.p contains values of \(p_{1,b}\), \(b=1,\dots,B\), while values of \(p_{2,b}\) are stored in vector pval2.p.

We apply the Tippett, Fisher, and Liptak combining functions. Recall that the Tippett combining function for general dimension \(d\) is \[ f_T(\boldsymbol{p})=\max_{1\leq j\leq d}(1-p_{j}), \] the Fisher omnibus combining function is \[ f_F(\boldsymbol{p})=-2\sum_{j=1}^d\log(p_{j}), \] and the Liptak combining function is \[ f_L(\boldsymbol{p}) = \sum_{j=1}^d \Phi^{-1}(1-p_{j}). \] For a combining function \(f\), we have \(Q_0 = f(\mathbf{p})\) for the original data (stored in f), and \(Q_b=f(\mathbf{p}_b)\) for \(b=1,\dots,B\) permutations (stored in a vector f.p). The multivariate permutation p-value is obtained, see also (C.3) of Section 2.1 of the paper, as \[ p_0= (B+1)^{-1} \Bigl\{1+\sum_{b=1}^B \boldsymbol{1}(Q_{b}\geq Q_{0})\Bigr\}. \]

## tippett combining function
f=pmax(1-pval1,1-pval2)
f.p=pmax(1-pval1.p,1-pval2.p)
pval.tp=(1+sum(f.p>=f))/(1+B)

## fisher combining function
f=-2*(log(pval1)+log(pval2))
f.p=-2*(log(pval1.p)+log(pval2.p))
pval.fp=(1+sum(f.p>=f))/(1+B)

## liptak combining function
f=qnorm(1-pval1)+qnorm(1-pval2)
f.p=qnorm(1-pval1.p)+qnorm(1-pval2.p)
pval.lp=(1+sum(f.p>=f))/(1+B)

print(paste("Tippett (p-value):",pval.tp))
## [1] "Tippett (p-value): 0.071"
print(paste("Fisher (p-value):",pval.fp))
## [1] "Fisher (p-value): 0.1125"
print(paste("Liptak (p-value):",pval.lp))
## [1] "Liptak (p-value): 0.1375"

Discrete optimal transport with a product grid

For the permutation test based on the optimal transport, we set \(B=999\). The following code shows construction of the product type grid with \(1000\) points, \(n_S=50\) and \(n_R=20\), \(n_0=0\). Since both test statistics Fmax and SQ are one-sided, we use only a grid on positive values, as described in Section 4.2 of the paper. The grid points are computed as \[ \mathbf{g}_{ij}= \frac{i}{n_R+1}\left(\cos\bigl(\frac{\pi}{2} \varphi_j\bigr),\sin \bigl(\frac{\pi}{2} \varphi_j\bigr) \right)^\top, \quad \varphi_j = \frac{j-1}{n_S-1}, \] \(i=1,\dots,n_R\), \(j=1,\dots,n_S\).

n0=0
nR=20
nS=50
bb=nS*nR+n0-1   # B=999

xs1=rep(1:nR,each=nS)/(nR+1)
xs2=rep((0:(nS-1))/(nS-1),nR)

## grid with positive values (for one-sided statistics or p-values)
s1pos=c(xs1*cos(pi*xs2/2),rep(0,n0))
s2pos=c(xs1*sin(pi*xs2/2),rep(0,n0))

target=cbind(s1pos,s2pos)

This grid will be utilized for the optimal transport of the test statistics in part A, in a sense of Section 4.1 of the paper. Part B. considers the optimal transport of the partial p-values, as proposed in Section 4.2 of the paper.

A. Optimal transport of the test statistics

The discrete optimal transport is computed using function solve_LSAP from package clue, similarly as in the first simulation example. The data points to be transported are stored in \(1000\times2\) matrix stat2d. The first row of the matrix corresponds to the statistics SQ and Fmax, respectively, computed from the original data (vector \(\mathbf{T}_0\)), the remaining rows are their permutation replicas (vectors \(\mathbf{T}_B = (T_{1,b},T_{2,b})^\top\) for \(b=1,\dots,B\)).

## distances (for optimal transport)
sqdist<-function(a,b){
    return(abs(a-b)^2)
}

## data to be transported: 
stat2d=cbind(c(covsq,covsq.b[1:bb]),c(fanfd,fanfd.b[1:bb])) 

## L2 distances
costes<-outer(stat2d[,1],target[,1],"sqdist")+outer(stat2d[,2],target[,2],"sqdist")

## calculate the optimal transport of statistics ("stat2d") to "target"
library(clue)
asignacion<-solve_LSAP(costes)

## transported test statistic
target.stat=target[asignacion[1],]

The situation is visualized on the plots below. The data point corresponding to the original data is stressed by red color in both plots.

par(mfrow=c(1,2))

plot(stat2d[-1,],pch=21,col="blue",bg="lightblue",type="p",xlab="SQ",ylab="Fmax")
points(stat2d[1],stat2d[2],pch=19,col="red")

FB=target[asignacion,]

plot(FB[-1,],pch=21,col="blue",bg="lightblue",type="p",cex=0.3,asp=1,xlab=expression(g[1]),ylab=expression(g[2]))
points(FB[1,1],FB[1,2],pch=19,col="red")

The resulting permutation p-value \(\widehat{p}_1\) is:

pval.tw=1-sqrt(sum(target.stat^2)) 
##
trgt.tw=target.stat

print(paste("p-value (T(P+), product grid):",format.pval(pval.tw)))
## [1] "p-value (T(P+), product grid): 0.047619"

Furthermore, we can compute the contributions of partial tests:

D=trgt.tw/sqrt(sum(trgt.tw^2))

print("contributions of partial test statistics:")
## [1] "contributions of partial test statistics:"
print(D^2)
##      s1pos      s2pos 
## 0.91904405 0.08095595

This shows that the contribution of SQ to the rejection of \(H_0\) is 92 %. The test statistic Fmax contributes only by 8 %.

B. Optimal transport of partial p-values

We also compute the discrete optimal transport of the vector \(\mathbf{q}_b=(1-p_{1,b},1-p_{2,b})\), as described in Section 4.2 of the paper.

pdata=cbind(c(1-pval1,1-pval1.p[1:bb]),c(1-pval2,1-pval2.p[1:bb])) 
costes<-outer(pdata[,1],target[,1],"sqdist")+outer(pdata[,2],target[,2],"sqdist")
asignacion<-solve_LSAP(costes)
target.stat=target[asignacion[1],]  

The situation is visualized on the following two plots. The point corresponding to the original data is highlighted by the red color.

par(mfrow=c(1,2))

plot(pdata[-1,],pch=21,col="blue",bg="lightblue",type="p",xlab="1-p-value SQ",ylab="1-p-value Fmax")
points(pdata[1],pdata[2],pch=19,col="red")

FB=target[asignacion,]

plot(FB[-1,],pch=21,col="blue",bg="lightblue",type="p",cex=0.3,asp=1,xlab=expression(g[1]),ylab=expression(g[2]))
points(FB[1,1],FB[1,2],pch=19,col="red")

The resulting p-value \(\widetilde{p}_a\) is:

pval.pw=1-sqrt(sum(target.stat^2))  
trgt.pw=target.stat
print(paste("p-value (p(P+), product grid):",format.pval(pval.pw)))
## [1] "p-value (p(P+), product grid): 0.095238"

The contributions of partial tests are:

print("contributions of partial test statistics:")
## [1] "contributions of partial test statistics:"
print(perc.pw<-(trgt.pw^2)/(sum(trgt.pw^2)))
##      s1pos      s2pos 
## 0.96345838 0.03654162

This approach suggests that the contribution of SQ to the rejection of \(H_0\) is 96. The test statistic Fmax contributes only by 4 %.

Discrete optimal transport with a non-product grid

Recall that a non-product grid of \(B+1\) points in \(\mathbb{R}^d\) requires a low-discrepancy sequence of \(B+1\) points in \([0,1]^d\). These can be obtained as a Halton sequence for any \(d\) and any \(B\), or as a GLP sequence, which gives good results only for certain choices of \(B\), see Chapter 1 of (Fang and Wang 1994). We illustrate here the later approach. The number of points \(B+1=987\) is chosen from Table A.1 in (Fang and Wang 1994). As before, only grid on \([0,1]^2\) is considered.

bb=986

## generating vector
h1=1
h2=610
fract=function(x){x-floor(x)}  

nn=bb+1 # B+1 points
k=1:nn
xk1=fract((2*k*h1-1)/(2*nn))
xk2=fract((2*k*h2-1)/(2*nn))
 
## spherically univorm grid only on the positive quadrant
g1pos=xk1*cos(pi*xk2/2)
g2pos= xk1*sin(pi*xk2/2)

target=cbind(g1pos,g2pos)

A. Optimal transport of the test statistics

The computation of the transport is analogous as for the product grid above:

stat2d=cbind(c(covsq,covsq.b[1:bb]),c(fanfd,fanfd.b[1:bb])) 
costes<-outer(stat2d[,1],target[,1],"sqdist")+outer(stat2d[,2],target[,2],"sqdist")
asignacion<-solve_LSAP(costes)
target.stat=target[asignacion[1],] 

The final p-value \(\widetilde{p}_a\) and the partial contributions are:

pval.tg=1-sqrt(sum(target.stat^2))  
trgt.tg=target.stat

print(paste("p-value (t(N+), non-product grid):",pval.tg))
## [1] "p-value (t(N+), non-product grid): 0.0805471124620062"
print("contributions of partial test statistics:")
## [1] "contributions of partial test statistics:"
print(perc.tg<-(trgt.tg^2)/(sum(trgt.tg^2)))
##      g1pos      g2pos 
## 0.92650707 0.07349293
par(mfrow=c(1,2))

plot(stat2d[-1,],pch=21,col="blue",bg="lightblue",type="p",xlab="SQ",ylab="Fmax")
points(stat2d[1],stat2d[2],pch=19,col="red")

FB=target[asignacion,]

plot(FB[-1,],pch=21,col="blue",bg="lightblue",type="p",cex=0.3,asp=1,xlab=expression(g[1]),ylab=expression(g[2]))
points(FB[1,1],FB[1,2],pch=19,col="red")

B. Optimal transport of partial p-values

The optimal transport of the partial p-values to the non-product grid proceeds analogously as well:

pdata=cbind(c(1-pval1,1-pval1.p[1:bb]),c(1-pval2,1-pval2.p[1:bb])) 
target=cbind(g1pos,g2pos)
costes<-outer(pdata[,1],target[,1],"sqdist")+outer(pdata[,2],target[,2],"sqdist")
asignacion<-solve_LSAP(costes)
target.stat=target[asignacion[1],]

The final p-value \(\widetilde{p}_a\) and the partial contributions are:

pval.pg=1-sqrt(sum(target.stat^2)) 
trgt.pg=target.stat
print(paste("p-value (p(N+), non-product grid):",pval.pg))
## [1] "p-value (p(N+), non-product grid): 0.0592705167173252"
print("contributions of partial test statistics:")
## [1] "contributions of partial test statistics:"
print(perc.pg<-(trgt.pg^2)/(sum(trgt.pg^2)))
##      g1pos      g2pos 
## 0.94298376 0.05701624
par(mfrow=c(1,2))

plot(pdata[-1,],pch=21,col="blue",bg="lightblue",type="p",xlab="1- p-value SQ",ylab="1-p-value Fmax")
points(pdata[1],pdata[2],pch=19,col="red")

FB=target[asignacion,]

plot(FB[-1,],pch=21,col="blue",bg="lightblue",type="p",cex=0.3,asp=1,xlab=expression(g[1]),ylab=expression(g[2]))
points(FB[1,1],FB[1,2],pch=19,col="red")

References

Cabassi, Alessandra, Davide Pigoli, Piercesare Secchi, and Patrick A Carter. 2017. “Permutation Tests for the Equality of Covariance Operators of Functional Data with Applications to Evolutionary Biology.” Ejs 11 (2): 3815–40.
Fang, K. T., and Y. Wang. 1994. Number-Theoretic Methods in Statistics. Chapman & Hall.
Górecki, Tomasz, and Łukasz Smaga. 2019. fdANOVA: An r Software Package for Analysis of Variance for Univariate and Multivariate Functional Data.” Cs 34 (2): 571–97.
Hlávka, Zdeněk, Daniel Hlubinka, and Kateřina Koňasová. 2022. “Functional ANOVA Based on Empirical Characteristic Functionals.” Jma 189: 104878.