Gene expression profile analysis and statistical learning

The analysis of gene expression is nothing but a pattern recognition problem, and we use various theories and methods of statistical learning for analyzing gene expression profile data. However, from the view of statistical learning, there are some characteristic difficulties in gene expression profile data analysis.

High-dimensionality

The number of genes in analysis object is large. Even when we restrict it, we must handle tens of genes, and sometimes, we handle thousands of genes at most. On the other hand, the number of cell samples is tens or hundreds of samples.

If you investigate the character of genes, you will express each gene as a vector whose dimensionality is the number of cell samples. If you investigate each sample, you express each sample as a vector whose dimensionality is the number of genes. In both cases, you will be against the difficulties with high-dimensionality.

Complexity and non-linearity

We must assume complex models behind the gene expression data, since there is a network involving interactions between thousands of genes. The noises and the artifacts are not small and may not be simple.

requested high reliability

When we design a clinical application, high reliability is requested for the analysis result. We must prepare some new approaches, since there are not enough method which can guarantee the statistical reliability in such a newly developed problem settings.

In our research against such difficult problems, we will apply plenty of our former results in statistical learning research at our laboratory. And, we will develop new special analysis methods and tools based on our general research on machine learning.

Up to these days, there are some results.

Parametric clustering

We proposed a model called mixture of constrained principal component analyses (MCPCA) model, which constrained Mixture of PCA model (see statistical learning page) so as to each principal axis aims at origin point in the high-dimensional space. Using MCPCA, we extracted reasonable gene clusters who has characteristic time courses in Bacillus coly development data. Cluster number estimation using variational Bayes method became more accurate than our former model.

We also validated the clustering results, by using another indices made from biological prospect.

  • Yoshioka, T., Morioka, R., Kobayashi, K., Oba, S., Ogasawara, N., & Ishii, S. Clustering of gene expression data by mixture of PCA models. International Conference on Artificial Neural Networks (ICANN 2002), Lecture Notes in Computer Science, 2415, Springer-Verlag, pp.522-527 (2002).

    Bayesian missing value estimator [BPCAfill]

    Gene expression profile often consists some missing values from various problems in measurement. We must fill in them by some appropriate values for the subsequent analysis.

    We developed a missing value estimator based on variational approximation of Bayesian principal component analyses. It exhibits very good accuracy which made extensive improvement from former methods. In addition, there are some useful characters. It do not need any hand tuning parameters. It certainly improves the accuracy with larger number of genes and/or larger number of samples.

    This method named "BPCAfill" is open to public use as an application software written in MATLAB and Java. See also the [software page].

  • Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K., & Ishii, S. Missing value estimation using mixture of PCAs. International Conference on Artificial Neural Networks (ICANN 2002), Lecture Notes in Computer Science, 2415, Berlin: Springer-Verlag, pp.492-497 (2002).
  • Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K., and Ishii, S. A Bayesian Missing value estimation method, Bioinformatics, 19 pp.2088-2096.(2003), URL= http://bioinformatics.oupjournals.org/cgi/content/abstract/19/16/2088?etoc

written by Shigeyuki Oba