Home Join Contact

Research Article

Open Access
L1 Least Square for Cancer Diagnosis using Gene Expression Data
Xiyi Hang1 and Fang-Xiang Wu2,3,*
1Department of Electrical and Computer Engineering, California State University, Northridge, CA 91330, USA
2Department of Mechanical Engineering
3Divsion of Biomedical Engineering, University of Saskatchewan, Saskatoon, Saskatchewan, S7N 5A9, Canada
*Corresponding author: Dr. Fang-Xiang Wu,
Divsion of Biomedical Engineering University of Saskatchewan,
Saskatoon, Saskatchewan, S7N 5A9, Canada,
E-mail : xhang@csun.edu, faw341@mail.usask.ca
Received March 19, 2009; Accepted April 27, 2009; Published April 27, 2009
Citation: Hang X, Wu FX (2009) L1 Least Square for Cancer Diagnosis using Gene Expression Data. J Comput Sci Syst Biol 2: 167-173. doi:10.4172/jcsb.1000028
Copyright: © 2009 Hang X, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
The performance of most methods for cancer diagnosis using gene expression data greatly depends on careful model selection. Least square for classification has no need of model selection. However, a major drawback prevents it from successful application in microarray data classification: lack of robustness to outliers. In this paper we cast linear regression as a constrained l1-norm minimization problem to greatly alleviate its sensitivity to outliers, and hence the name l1 least square. The numerical experiment shows that l1 least square can match the best performance achieved by support vector machines (SVMs) with careful model selection.

l1-norm minimization; least square regression; classification; cancer; gene expression data; support vector machine

DNA microarray technique has the potential to provide a more accurate and objective cancer diagnosis than traditional histopathological approach with its high throughput capability of simultaneously measuring relative expression level of tens of thousands of genes. The success, however, greatly depends upon the supervised learning algorithm selected to classify gene expression data.

Many well-established methods are available for gene expression profile classification. According to Lee et al (2005), they can be classified into four categories: (1) classical methods, such as Fisher’s linear discriminant analysis, logistic regression, K-nearest neighbor, and generalized partial least square, (2) classification trees and aggregation methods, such as CART, random forest, bagging and boosting, (3) machine learning methods, such as neural network and support vector machines (SVMs), and (4) generalized methods, such as flexible discriminant analysis, mixture discriminant analysis, and shrunken centroid method. The performance of many methods, however, relies upon careful choice of model parameters, which can be done via model selection procedure such as cross validation. For example, the model parameters for SVMs include kernel parameters and the penalty parameter C. A recent controversy regarding the performance comparison between SVM and random
forest just exemplifies the importance of model selection. The study by Diaz-Uriarte et al. (2006) concludes that random forest outperforms SVM, and the conclusion in paper (Stanikov et al, 2008) is totally opposite. The main difference between these two studies is that model selection is carefully designed in the latter study but not in the former study. The incident also shows that model selection may be the obstacle of the extensive application of SVM in classification of gene expression
profile. Since classification performance is a nonconvex function of model parameters, it is usually
difficult to find optimal model parameters by model selection.

Least square for classification, on the other hand, has no need of model selection. Consider a general classification problem with N classes. A linear model is built for each class k
 Yk = WTk X + Wk0, k = 1,2,..., N.                          (1)

The N equations can be grouped into
  y=Wx˜                                           (2)

where y = [y1, y2,...,yN]T, W is a matrix whose kth row is [ wTk, wk0 ], and x˜ =[xT,1]T. For a training dataset {(xi , ti ),i= 1, 2,....,n}, where ti is 1-of-N binary coding vector of the label of the ith feature xi , i.e., a vector containing zeros everywhere except 1 in the kth position, if xi belongs to category k. Denote by X the feature matrix whose kth row is [xTk ,1] , and T the target matrix whose kth row is tTi.The linear regression model in (2) can be fitted simultaneously to each of columns of T, and the solution is in the form
Ŵ = (XTX)-1XTT                            (3)

• Calculate the fitted output ŷ = Ŵ[xT,1]T(an N-dimensional vector);
• Label = argmaxk ŷ(k), k = 1,2,.. .,N. More details can be found in literature (Bishop, 2006; Hastie
et al., 2001

The above approach, however, is very sensitive to outliers, especially for multicategory classification (N ≥ 3). Furthermore, when least square for classification is applied to gene expression data, problems can become more severe due to the curse of dimensionality caused by the great number of genes in each sample.
Inspired by the recent progress in sparse signal recovery via l1 – norm minimization (Candès et al., 2006, Candès and Tao, 2006; Donoho, 2006), we propose a new approach to overcome the major drawback of least square for classification by casting the linear regression problem as a constrained
l1 – norm minimization problem. The obtained sparse solution is much less sensitive to both outliers and curse of dimensionality. In addition, multicategory classification is realized via one-versus-rest (OVR) and one-versus- one (OVO) approaches which decompose the original multi-category problem into a series of binary problems. The new method is validated by comparing caner diagnosis performance with SVMs.


Binary l1 Least Square

Consider a training dataset {(xi, yi);i=1,...,n}, xi€ Rd, yi €{-1, +1}, where xi represents the ith sample, a d-dimensional column vector containing gene expression values with d as the number of genes, and yi is the label of the ith sample. Two classes are described by a liner model
 y = [xT,1]w                                      (4)

for any sample x. Applying the linear model to the training dataset, we have
 yi = [xT,1]w, i = 1,2,...,n                 (5)

The n equations can be grouped into
   y = Xw                                          (6)

where y = [y1, y2,...,yn]T, and X is an n × (d +1) matrix whose ith row is [xiT,1]T. Since the number of samples are much smaller than the number of genes, i.e., n << d, the system in (6) is underdetermined. The solution is obtained by casting the original problem as the following constrained l1-norm minimization problem
min ||w||1 subject to Xw = y          (7)

The above formulation is inspired by the recent progress in compressed sensing (Candès et al., 2006; Candès and Tao, 2006; Donoho, 2006) and basis pursuit denoising (Chen et al., 2005).

There are quite a few solvers available for solving the optimization problem defined in (7), such as MOSEK (Andersen, 2002) PDCO-CHOL (Saunders, 2002), PDCOLSQR (Saunders, 2002), and l1-magic (Candès and Romberg, 2006), which all belong to interior-point methods. In this study we choose a solver called SPGL1 (Friedlander and Van den Berg, 2008) for its efficiency in solving largescale problems. Unlike other methods, SPGL1 solves the optimization problem by converting it into a root finding problem. Please refer to paper (Van den Berg and Friedlander, 2008) for details on the theory of SPGL1.

Denote by ŵ the solution to (7). Then for any sample x, the label can be simply assigned as sign([xT ,1]ŵ).

Multicategory l1 Least Square: OVR
Consider a multicategory training dataset {(xi, yi); i=1,...,n}, xi € Rd, yi € {1,2,...n}, where N is the category number. OVR approach needs to determine for each class a binary classifier to separate it from the remaining classes. The N linear models are defined as
 Dk(x) = [ XT ,1] wk, k = 1,2,...N         (8)

For category k, after changing the labels of those samples belonging to k to +1, and others to -1, we apply the linear model to the training dataset
yk = Xwk , k = 1, 2,....,N,                    (9)

where yk is a label vector containing either +1 or -1. Similarly, the above N underdetermined systems can be solved by the following N constrained l1-norm minimization problems
min || wk || 1 subject to Xwk = yk          (10)
where k = 1,2,....N.

Denote by kwk̂ the solution to (10). Then for any sample x, the label can be determined by

arg maxk=1,2,...N Dk (x) = [ X T, 1] Ŵk.        (11)

Multicategory l1 Least Square: OVO
In OVO approach, a binary classifier is constructed for each pair of classes. The linear model for class i against class j is given by

Di,j(x) = [XT,1]wi,j                                       (12)

For those samples of category i and j, changing their labels to +1 and -1, applying the linear model gives rise to
 yi,j = Xi,j Wi,j                                              (13)

where yi,j is a vector containing either +1 or -1, and Xi,j is a matrix whose kth row is [XT,1]T with xk belonging to either category i or j. The underdetermined system is solved by
 min || Wi,j ||1 subject to Xi,j Wi,j  =  yi,j         (14)

Since Dj,i  =    D i,j, the number of the classifiers is (2N) ,i.e., N(N-1)/2, compared to N in OVR approach.

Denote by Ŵi,j the solution to (14). For any sample x, we calculate
 Di (x) = ∑Nj=1,j≠1 sign(Di,j(x))                          (15)

with  Di,j (x) = ∑Nj=1,j≠1 Xi,jij. The label of x is determined by

arg max1,2,...,N Di(x)                                          (16)

Numerical Experiment
Numerical experiment is carefully designed to validate the cancer diagnosis performance of l1 least square using gene expression data. The performance metric is classification accuracy obtained by 10-fold stratified cross validation. MATLAB R14 is used to implement the new method. The results are compared with binary SVM (Vapnik, 1998) and some popular variants of multicategory SVMs including OVR-SVM (Kressel, 1999), OVO-SVM (Kressel, 1999), DAGSVM (Platt et al., 2000), method by Weston and Watkins (WW) (Weston and Watkins, 1999), and method by Crammer and Singer (CS) (2000).

The results of SVMS are obtained from GEMS (Gene Expression Model Selector), which is software with graphic user interface for classification of gene expression data. It is freely available at http://www.gems-system.org/. GEMS is used by Stanikov et al.(2005) for the comprehensive study
of the performance of multiple classifiers on gene expression cancer diagnosis. As for model selection, polynomial kernels are used with orders p = {1,2, 3}, and the penalty parameter C = {10-3+0.5n, n = 0, 1, …, 6}.

Six datasets are used in the experiment, which are among eleven datasets used in reference (Stanikov et al., 2005). They are available on the website of GEMS in the format of both GEMS and MATLAB mat file. For easy comparison and reference, we adopt the names used in reference
(Stanikov et al., 2005). The information about the six datasets is summarized below.
DLBCL (Shipp et al., 2002): The binary dataset comes from a study of gene expression of two lymphomas: diffuse large B-cell lymphomas and follicular lymphomas. Each sample contains 5469 genes. The sample number is 77.
Prostate_Tumor (Singh et al., 2002): The binary dataset contains gene expression data of prostate tumor and normal tissues. There are 10509 genes in each sample, and 102 samples.
9_Tumors (Staunton et al., 2001): The dataset comes from a study of 9 human tumor types: NSCLC, colon, breast, ovary, leukaemia, renal, Melanoma, prostate, and CNS. There are 60 samples, each of which contains 5726 genes.
11_Tumors (Su et al., 2001): The dataset includes 174 samples of gene expression data of 11 various human tumor types: ovary, bladder/ureter, breast, colorectal, gastroesophagus, kidney, liver, prostate, pancreas, lung adeno, and lung squamous. The number of genes is 12533.
Brain_Tumor1 (Pomeroy et al., 2002): The dataset comes from a s study of 5 human brain tumor types: medulloblastoma, malignant glioma, AT/RT, normal cerebellum, and PNET, including 90 samples. Each sample has 5920 genes.
Brain_Tumor2 (Nutt et al., 2003): There are 4 types of malignant glioma in this dataset: classic glioblastomas, classic anaplastic oligoden-drogliomas, non-classic glioblastomas, and non-classic anaplastic oligodendrogliomas. The dataset has 50 samples, and the number of genes
is 10367.

All the datasets are normalized by rescaling the gene expression values to be between 0 and 1.

Two methods are used in this experiment to study gene selection’s impact on classification performance: Kruskal- Wallis non-parametric one-way ANOVA (KW) (Gibbons, 2003), and the ratio of between classes to within class sums of square (BW) (Dudoit et al., 2002).


Classification without Gene Selection
Table 1shows the classification accuracy values obtained by 10-fold stratified cross validation for both l1 least square and SVMs. The results of SVMs are slightly different from what is reported by Stanikov et al. (2005) where the five datasets are also used. A possible explanation is that the
distribution for cross validation in our study is different from that in paper (Stanikov et al., 2005).

Table 1: Performance without gene selection.

For binary datasets Prostate_Tumor and DLBCL, the performance of l1 least square is slightly below that of SVMs. Note that the results of SVMs are obtained by careful model selection using cross validation, while our method does not need model selection, and is totally automatic. In addition, just like SVM, when applied to binary datasets, the multicategory classifiers of l1 least square are equivalent to binary classifier for both OVO and OVR approaches.

When applied to classification of multicategory datasets, OVR- l1 least square can closely match the best performance achieved by SVMs. For both SVM and l1 least square, OVO approach performs much worse than OVR approach for classifying 9_Tumors dataset.

Classification with Gene Selection
Table 2 shows the best performance achieved by OVR- l1 least square and SVMs when gene selection methods KW and BW are used. The results show that both l1 least square and SVMs perform slightly better compared with the performance without gene selection reported in Table 1. The improvement ranges from 0 to 9% for SVMS, while only from 0 to 3.48% for OVR-l1 least square. Again, the performance of OVR- l1 least square is comparable to SVMs.

Table 2: Performance with gene selection.


The success of l1 least square may lie in its sparse linear model coefficient vector obtained from l1 – norm minimization. Figure 1 shows the model coefficient vector w which is the solution of l1 least square for classifying binary dataset DLBCL. The sparsity suggests that those genes with greater absolute coefficients could have played more important roles in classification. As a result, the classification performance does not depend on all the genes, especially those with very small absolute coefficients. The sparsity has the potential to greatly alleviate curse of dimensionality and increase the robustness to outliers.

Another implication of sparsity is that those genes with larger absolute coefficients may correspond to biological markers. Hence, sparsity could be also used for gene selection. We did a small experiment to verify this possibility. The binary dataset DLBCL is used to fit l1 least square model. Gene selection is done by choosing M genes with M largest absolute coefficients. Binary SVM is used to classify the gene-selected data. The results are compared with KW and BW methods for gene selection. Figure 2 shows the performance of the three gene selection methods for M =10, 20, 30, 40, and 50, respectively. The new method significantly outperforms both KW and BW methods when a small number of genes are selected.

Figure1: The sparse coefficient vector.

Figure2: The performance of three gene Selection methods.

The above gene selection approach is in spirit similar to lasso (Tibshirani, 1996) formulated as follows
 min || Xw - y ||22  subject to || w ||1 ≤ t                        (17)

where X, w, and y follow the definitions given in section 2.1 for binary l1 least square, and t is the model parameter for lasso. In addition, lasso can also be used in classification by replacing (7) with (17) for binary case, (10) with
 min || Xwk - yk ||22 subject to || wk ||1 ≤ tk                    (18)

for multi-category OVR approach, and (14) with
 min || Xwi,j - yi,j ||22 subject to || wi,j ||1 ≤ ti,j                 (19)
for multi-category OVO approach.

Similarly, we can also replace l1 least square regression with Dantzig selector (Candès and Tao, 2007), which is given below for binary classification
 min || w || subject to || XT(y - Xw) || ≤ (1 + t-1) √(2logd.σ)        (20)

where t is model parameter, andσ is the noise standard deviation. Dantzig selector for multicategory classification can be similarly defined.

Both lasso and Dantzig selector for classification, however, still need to select optimized model parameters by model selection procedure, such as cross validation.

In this paper, we have described a specialized regression method for cancer diagnosis using expression data. The new approach, called l1 least square, casts linear regression as a constrained l1-norm minimization problem to overcome the major drawback of least square for classification: lack of robustness to outliers. Besides binary classifier, multicategory l1 least square including OVO and OVR approaches are also proposed.

Numerical experiment shows that OVR- l1 least square can match the best performance achieved by SVMs with careful model selection. The main advantage of l1 least square over other methods including SVMs is that it has no need of model selection. As a result, the method based on l1 least square is totally automatic. l1 least square also has the potential to be used for gene selection.

The l1 least square classifier may become a promising automatic cancer diagnosis tool by consistently distinguishing gene profile classes. Those genes with great absolute regression coefficients may serve as biological marker candidates for further investigation.


  1. Bishop CM: Pattern recognition and machine learning. New York: Springer; 2006.  »  Google Scholar

  2. Candès E, Romberg J, Tao T (2006) Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. on Information Theory 52: 489-509. » CrossRef   »  Google Scholar

  3. Candès E, Tao T (2006) Near optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. on Information Theory 52: 5406-5425.

  4. Candès E, Romberg J (2006) l1 -magic: A Collection of MATLAB Routines for Solving the Convex Optimization Programs Central to Compressive Sampling [Online]. Available: www.acm.caltech.edu/l1magic/

  5. Candès E, Tao T (2007) The Dantzig selector: Statistical estimation when p is much larger than n. Ann Statist 35: 2313-2351. » CrossRef  »  Google Scholar

  6. Chen S, Donoho D, Saunders M (2001) Atomic decomposition by basis pursuit. SIAM Rev 43: 129-159. » CrossRef  »  Google Scholar

  7. Crammer K, Singer Y (2000) On the learnability and design of output codes for multiclass problems. Proceedings of the Thirteen Annual Conference on Computational Learning Theory. Standford University Palo Alto CA June 28–July 1.

  8. Diaz-Uriarte R, Alvarez de Andres S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7: 3. » CrossRef   » PubMed  »  Google Scholar

  9. Donoho D (2006) Compressed sensing. IEEE Trans on Information Theory 52: 1289-1306. » CrossRef   »  Google Scholar

  10. Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97: 77-87. » CrossRef   »  Google Scholar

  11. Friedlander M, Van den Berg E (2008) SPGL1, a solver for large scale sparse reconstruction.
    [Online]Available: http://www.cs.ubc.ca/labs/scl/spgl1/

  12. Gibbons JD: Nonparametric Statistical Inference, 4th edition, CRC, 2003.

  13. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. New York: Springer. » CrossRef  »  Google Scholar

  14. Kressel U (1999) Pairwise classification and support vector machines. In Advances in Kernel Methods: Support Vector Learning, (Chapter 15.) Cambridge, MA: MIT Press.

  15. Lee JW, Lee JB, Park M, Song SH (2005) An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis 48: 869-885. » CrossRef   »  Google Scholar

  16. Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, et al. (2003) Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res 63: 1602-1607. » CrossRef   » PubMed  »  Google Scholar

  17. Platt JC, Cristianini N, Shawe-Taylor J (2000) Large margin DAGS for multiclassclassification. In Advances in Neural Information Processing Systems 12. MIT Press.  »  Google Scholar

  18. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, et al. (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415: 436-442. » CrossRef   » PubMed  »  Google Scholar

  19. Saunders M (2002) PDCO: Primal-Dual Interior Method for Convex Objectives [Online]. Available: http://www.stanford.edu/group/SOL/software/pdco.html.

  20. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, et al. (2002) Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nat Med 8: 68-74. » CrossRef   » PubMed  »  Google Scholar

  21. Statnikov A, Wang L, Aliferis CF (2008) A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 9: 319. » CrossRef   » PubMed  »  Google Scholar

  22. Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, et al. (2001) Chemosensitivity prediction by transcriptional profiling. Proc Natl Acad Sci USA 98: 10787- 10792. » CrossRef   » PubMed  »  Google Scholar

  23. Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21: 631-643. » CrossRef   » PubMed  »  Google Scholar

  24. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, et al. (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 203-209. » CrossRef   » PubMed  »  Google Scholar

  25. Su AI, Welsh JB, Sapinoso LM, Kern SG, Dimitrov P, et al. (2001) Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res 61: 7388-7393. » CrossRef   » PubMed  »  Google Scholar

  26. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Statist Soc ser B 58: 267-288. » CrossRef   »  Google Scholar

  27. The MOSEK Optimization Tools Version 2.5. User’s Manual and Reference 2002 [Online]. Available: www.mosek.com.

  28. Van den Berg E, Friedlander M (2008) Probing the Pareto frontier for basis pursuit solution. Technical Report 2008, Department of Computer Science, University of British Columbia.

  29. Vapnik VN: Statistical learning theory. New York: Wiley; 1998.

  30. Weston J, Watkins C (1999) Support vector machines for multi-class pattern recognition. In Proceedings of the Seventh European Symposium On Artificial Neural Networks (ESANN 99) Bruges April 21-23.
This Article
» XML (51 KB)
» PDF (828 KB)
» Citation



Related Article at