BPCAfill is a missing value estimation software suitable for gene expression profile data. (ex. measured by DNA microarrays)
BPCAfill estimates missing values in the gene expression matrix, and fill them by the estimated values. The estimation almost always exhibits the best accuracy of all existing methods (as long as I know in 2003). Especially, when your data set has large number of genes and/or large number of samples, BPCAfill will efficiently use all the provided informations and output the best result, even when there are large number of missing entries.
In addition, it is very easy to use, because it do not need any type of hand tuning parameters.
See also the paper,
26 Nov. 2002 :
first release
18 Nov. 2004 :
slightly updated
Zipped archive of MATLAB source files: BPCAFILL.zip(4kB)
Contents in the BPCAFILL.zip BPCAfill.m , BPCA_dostep.m , BPCA_initmodel.m , BPCA_reestimation.m , BPCA_plotM.m , missingNRMSE.m
Zipped archive of small sample data files: SAMPLE.zip(407kB)
Contents in the SAMPLE.zip sample999.dat , sampleorg.dat
Place the source files (*.m) in arbitrary directory in MATLAB PATH.
Prepare data matrix in which missing entries in matrix are
denoted by value 999.0.
BPCAfill estimates the missing entries using BPCA model
and returns complete matrix filled with the estimated values.
>> x999 = load('sample999.dat'); >> xfilled = BPCAfill( x999 ); epoch=10, dtau=1.25922 epoch=20, dtau=0.0970274 epoch=30, dtau=0.0349725 epoch=40, dtau=0.0150597 epoch=50, dtau=0.00761689 epoch=60, dtau=0.0034555 epoch=70, dtau=0.000254066 epoch=80, dtau=4.42614e-005 <-- when dtau < 1e-004, it is decided as converged. >> save FilledExpression.dat -ascii xfilled
Please wait for the calculation to converge. It will take some minutes or hours till finish. (Example. The sample999.dat (757x50 entries with 797 missing) costs 10 minutes to fill using Pentium III 600MHz processer. )
After the BPCAfill process, you can obtain a rough measure of the estimation acculacy prediction by using re-estimation process. BPCA_reestimation simulates 1% of artificial missing entries in the filled matrix, and estimates them again. The error on the re-estimation process will be a good prediction on the original error.
If the estimated NRMSE (normalized root mean squared error) value is near 1.00, the original fill might also not be reliable because of noise in the original expression data and/or other reasons.
>> [xfilled,M] = BPCAfill(sample999); ... >> estimatedNRMSE = BPCA_reestimation(M); rate = 0.0100 epoch=10, dtau=1.2374 ... epoch=80, dtau=5.04714e-005 >> estimatedNRMSE estimatedNRMSE = 0.3664 >> xorg = load('sampleorg.dat'); >> trueNRMSE = missingNRMSE(x999, xorg, xfilled) trueNRMSE = 0.4375
In the abobe, M and M2 are MATLAB structure variables which consist whole result of BPCA model estimation.
BPCA is a Bayesian variation of PCA, and it estimates covariance of the expression matrix as some principal axes. A profile of the result will be illustrated by BPCA_plotM.
>> [xfilled, M] = BPCAfill( x999 ); .... >> BPCA_plotM( M );