Gene expression profile analysis and statistical learning
The analysis of gene expression is nothing but a pattern recognition problem, and we use various theories and methods of statistical learning for analyzing gene expression profile data. However, from the view of statistical learning, there are some characteristic difficulties in gene expression profile data analysis.
The number of genes in analysis object is large. Even when we restrict it, we must handle tens of genes, and sometimes, we handle thousands of genes at most. On the other hand, the number of cell samples is tens or hundreds of samples.
If you investigate the character of genes, you will express each gene as a vector whose dimensionality is the number of cell samples. If you investigate each sample, you express each sample as a vector whose dimensionality is the number of genes. In both cases, you will be against the difficulties with high-dimensionality.
Complexity and non-linearity
We must assume complex models behind the gene expression data, since there is a network involving interactions between thousands of genes. The noises and the artifacts are not small and may not be simple.
requested high reliability
When we design a clinical application, high reliability is requested for the analysis result. We must prepare some new approaches, since there are not enough method which can guarantee the statistical reliability in such a newly developed problem settings.
In our research against such difficult problems, we will apply plenty of our former results in statistical learning research at our laboratory. And, we will develop new special analysis methods and tools based on our general research on machine learning.
Up to these days, there are some results.
We proposed a model called mixture of constrained principal component analyses (MCPCA) model, which constrained Mixture of PCA model (see statistical learning page) so as to each principal axis aims at origin point in the high-dimensional space. Using MCPCA, we extracted reasonable gene clusters who has characteristic time courses in Bacillus coly development data. Cluster number estimation using variational Bayes method became more accurate than our former model.
We also validated the clustering results, by using another indices made from biological prospect.
written by Shigeyuki Oba