Research Article 

Open Access 


L_{1} Least Square for Cancer Diagnosis
using Gene Expression Data 
Xiyi Hang^{1} and FangXiang Wu^{2,3,*} 
^{1}Department of Electrical and Computer Engineering,
California State University, Northridge, CA 91330, USA 
^{2}Department of Mechanical Engineering 
^{3}Divsion of Biomedical Engineering, University of Saskatchewan,
Saskatoon, Saskatchewan, S7N 5A9, Canada 
^{*}Corresponding author: 
Dr. FangXiang Wu,
Divsion of Biomedical Engineering
University of Saskatchewan,
Saskatoon, Saskatchewan, S7N 5A9,
Canada,
Email : xhang@csun.edu, faw341@mail.usask.ca 

Received March 19, 2009; Accepted April 27, 2009; Published April 27, 2009 
Citation: Hang X, Wu FX (2009) L1 Least Square for Cancer Diagnosis using Gene Expression Data. J Comput Sci Syst
Biol 2: 167173. doi:10.4172/jcsb.1000028 

Copyright: © 2009 Hang X, et al. This is an openaccess article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author
and source are credited. 


The performance of most methods for cancer diagnosis using gene expression data greatly depends on careful
model selection. Least square for classification has no need of model selection. However, a major drawback prevents
it from successful application in microarray data classification: lack of robustness to outliers. In this paper we cast
linear regression as a constrained l_{1}norm minimization problem to greatly alleviate its sensitivity to outliers, and
hence the name l_{1} least square. The numerical experiment shows that l_{1} least square can match the best performance
achieved by support vector machines (SVMs) with careful model selection. 
Keywords 
l_{1}norm minimization; least square regression; classification; cancer; gene expression data; support vector machine 
Introduction 
DNA microarray technique has the potential to provide a
more accurate and objective cancer diagnosis than traditional
histopathological approach with its high throughput
capability of simultaneously measuring relative expression
level of tens of thousands of genes. The success, however,
greatly depends upon the supervised learning algorithm
selected to classify gene expression data. 
Many wellestablished methods are available for gene
expression profile classification. According to Lee et al
(2005), they can be classified into four categories: (1) classical
methods, such as Fisher’s linear discriminant analysis,
logistic regression, Knearest neighbor, and generalized
partial least square, (2) classification trees and aggregation
methods, such as CART, random forest, bagging and boosting,
(3) machine learning methods, such as neural network and support vector machines (SVMs), and (4) generalized
methods, such as flexible discriminant analysis, mixture
discriminant analysis, and shrunken centroid method. The
performance of many methods, however, relies upon careful
choice of model parameters, which can be done via model
selection procedure such as cross validation. For example,
the model parameters for SVMs include kernel parameters
and the penalty parameter C. A recent controversy regarding
the performance comparison between SVM and random
forest just exemplifies the importance of model selection.
The study by DiazUriarte et al. (2006) concludes that
random forest outperforms SVM, and the conclusion in
paper (Stanikov et al, 2008) is totally opposite. The main
difference between these two studies is that model selection
is carefully designed in the latter study but not in
the former study. The incident also shows that model selection may be the obstacle of the extensive application
of SVM in classification of gene expression
profile. Since classification performance is a
nonconvex function of model parameters, it is usually
difficult to find optimal model parameters by model
selection. 
Least square for classification, on the other hand, has no
need of model selection. Consider a general classification
problem with N classes. A linear model is built for each
class k 
Y_{k} = W^{T}_{k} X + W_{k0}, k = 1,2,..., N. (1) 
The N equations can be grouped into 
y=Wx˜ (2) 
where y = [y_{1}, y_{2},...,y_{N}]^{T}, W is a matrix whose kth row is [ w^{T}_{k}, w_{k0} ], and x˜ =[x^{T},1]^{T}. For a training dataset {(x_{i} , t_{i} ),i= 1, 2,....,n}, where t_{i} is 1ofN binary coding vector
of the label of the ith feature x_{i} , i.e., a vector containing
zeros everywhere except 1 in the kth position, if x_{i} belongs
to category k. Denote by X the feature matrix whose kth row is [x^{T}_{k} ,1]
, and T the target matrix whose kth row is t^{T}_{i}.The linear regression model in (2) can be fitted simultaneously
to each of columns of T, and the solution is in the
form 
Ŵ = (X^{T}X)^{1}X^{T}T (3) 
• Calculate the fitted output ŷ = Ŵ[x^{T},1]^{T}(an Ndimensional
vector); 
• Label = argmax_{k} ŷ(k), k =
1,2,.. .,N. More details can be found in literature (Bishop, 2006; Hastie
et al., 2001). 
The above approach, however, is very sensitive to outliers,
especially for multicategory classification (N ≥ 3).
Furthermore, when least square for classification is
applied to gene expression data, problems can become more
severe due to the curse of dimensionality caused by the great
number of genes in each sample. 
Inspired by the recent progress in sparse signal recovery
via l_{1} – norm minimization (Candès et al., 2006, Candès
and Tao, 2006; Donoho, 2006), we propose a new approach
to overcome the major drawback of least square for classification
by casting the linear regression problem as a constrained
l_{1} – norm minimization problem. The obtained
sparse solution is much less sensitive to both outliers and
curse of dimensionality. In addition, multicategory classification is realized via oneversusrest (OVR) and oneversus
one (OVO) approaches which decompose the original
multicategory problem into a series of binary problems.
The new method is validated by comparing caner diagnosis
performance with SVMs. 
Methods 
Binary l_{1} Least Square 
Consider a training dataset {(x_{i}, y_{i});i=1,...,n},
x_{i}€ R^{d}, y_{i} €{1, +1}, where x_{i} represents the ith sample, a ddimensional column vector containing gene expression
values with d as the number of genes, and y_{i} is the label of
the ith sample. Two classes are described by a liner model 
y = [x^{T},1]w (4) 
for any sample x. Applying the linear model to the training
dataset, we have 
y_{i} = [x^{T},1]w, i = 1,2,...,n (5) 
The n equations can be grouped into 
y = Xw (6) 
where y = [y_{1}, y_{2},...,y_{n}]^{T}, and X is an n × (d +1) matrix whose ith row is [x_{i}^{T},1]^{T}. Since the number of samples are much
smaller than the number of genes, i.e., n << d, the system in
(6) is underdetermined. The solution is obtained by casting
the original problem as the following constrained l_{1}norm
minimization problem 
min w_{1} subject to Xw = y (7) 
The above formulation is inspired by the recent progress
in compressed sensing (Candès et al., 2006; Candès and
Tao, 2006; Donoho, 2006) and basis pursuit denoising (Chen
et al., 2005). 
There are quite a few solvers available for solving the
optimization problem defined in (7), such as MOSEK
(Andersen, 2002) PDCOCHOL (Saunders, 2002), PDCOLSQR
(Saunders, 2002), and l1magic (Candès and Romberg,
2006), which all belong to interiorpoint methods. In
this study we choose a solver called SPGL1 (Friedlander
and Van den Berg, 2008) for its efficiency in solving largescale
problems. Unlike other methods, SPGL1 solves the
optimization problem by converting it into a root finding
problem. Please refer to paper (Van den Berg and Friedlander,
2008) for details on the theory of SPGL1. 
Denote by ŵ the solution to (7). Then for any sample x,
the label can be simply assigned as sign([x^{T} ,1]ŵ). 
Multicategory l_{1} Least Square: OVR 
Consider a multicategory training dataset {(x_{i}, y_{i}); i=1,...,n}, x_{i} € R^{d}, y_{i} € {1,2,...n}, where N is the
category number. OVR approach needs to determine for
each class a binary classifier to separate it from the remaining
classes. The N linear models are defined as 
D_{k}(x) = [ X^{T} ,1] w_{k}, k = 1,2,...N (8) 
For category k, after changing the labels of those samples
belonging to k to +1, and others to 1, we apply the linear
model to the training dataset 
y_{k} = Xw_{k} , k = 1, 2,....,N, (9) 
where y_{k} is a label vector containing either +1 or 1. Similarly,
the above N underdetermined systems can be solved
by the following N constrained l_{1}norm minimization problems 
min  w_{k}  _{1 }subject to X_{wk } = y_{k } (10) 
where k = 1,2,....N. 
Denote by k_{wk}̂ the solution to (10). Then for any sample
x, the label can be determined by 
arg max_{k=1,2,...N} D_{k }(x) = [ X ^{T}, 1] Ŵk. (11) 
Multicategory l_{1} Least Square: OVO 
In OVO approach, a binary classifier is constructed for
each pair of classes. The linear model for class i against
class j is given by 
D_{i,j}(x) = [X^{T},1]w_{i,j} (12) 
For those samples of category i and j, changing their labels
to +1 and 1, applying the linear model gives rise to 
y_{i,j} = X_{i,j} W_{i,j} (13) 
where y_{i,j} is a vector containing either +1 or 1, and X_{i,j} is a
matrix whose kth row is [X^{T},1]^{T} with x_{k} belonging to either
category i or j. The underdetermined system is solved by 
min  W_{i,j} _{1} subject to X_{i,j} W_{i,j} = y_{i,j} (14) 
Since D_{j,i} = D _{i,j}, the number of the classifiers is (_{2}^{N}) ,i.e., N(N1)/2, compared to N in OVR approach. 
Denote by Ŵi,j the solution to (14). For any sample x, we
calculate 
D_{i} (x) = ∑^{N}_{j=1,j≠1} sign(D_{i,j}(x)) (15) 
with D_{i,j } (x) = ∑^{N}_{j=1,j≠1} X_{i,j}Ŵ_{ij}. The label of x is determined by 
arg max_{1,2,...,N} D_{i}(x) (16) 
Numerical Experiment 
Numerical experiment is carefully designed to validate
the cancer diagnosis performance of l1 least square using
gene expression data. The performance metric is classification
accuracy obtained by 10fold stratified cross validation.
MATLAB R14 is used to implement the new method.
The results are compared with binary SVM (Vapnik, 1998)
and some popular variants of multicategory SVMs including
OVRSVM (Kressel, 1999), OVOSVM (Kressel,
1999), DAGSVM (Platt et al., 2000), method by Weston
and Watkins (WW) (Weston and Watkins, 1999), and
method by Crammer and Singer (CS) (2000). 
The results of SVMS are obtained from GEMS (Gene
Expression Model Selector), which is software with graphic
user interface for classification of gene expression data. It
is freely available at http://www.gemssystem.org/. GEMS
is used by Stanikov et al.(2005) for the comprehensive study
of the performance of multiple classifiers on gene expression
cancer diagnosis. As for model selection, polynomial
kernels are used with orders p = {1,2, 3}, and the penalty
parameter C
= {10^{3+0.5n}, n = 0, 1, …, 6}. 
Six datasets are used in the experiment, which are among
eleven datasets used in reference (Stanikov et al., 2005).
They are available on the website of GEMS in the format
of both GEMS and MATLAB mat file. For easy comparison
and reference, we adopt the names used in reference
(Stanikov et al., 2005). The information about the six
datasets is summarized below. 
• 
DLBCL (Shipp et al., 2002): The binary dataset comes
from a study of gene expression of two lymphomas: diffuse
large Bcell lymphomas and follicular lymphomas.
Each sample contains 5469 genes. The sample number
is 77. 
• 
Prostate_Tumor (Singh et al., 2002): The binary dataset
contains gene expression data of prostate tumor and
normal tissues. There are 10509 genes in each sample,
and 102 samples. 
• 
9_Tumors (Staunton et al., 2001): The dataset comes
from a study of 9 human tumor types: NSCLC, colon,
breast, ovary, leukaemia, renal, Melanoma, prostate, and
CNS. There are 60 samples, each of which contains 5726
genes. 
• 
11_Tumors (Su et al., 2001): The dataset includes 174 samples of gene expression data of 11 various human
tumor types: ovary, bladder/ureter, breast, colorectal,
gastroesophagus, kidney, liver, prostate, pancreas, lung
adeno, and lung squamous. The number of genes is
12533. 
• 
Brain_Tumor1 (Pomeroy et al., 2002): The dataset comes
from a s study of 5 human brain tumor types: medulloblastoma,
malignant glioma, AT/RT, normal cerebellum,
and PNET, including 90 samples. Each sample has 5920
genes. 
• 
Brain_Tumor2 (Nutt et al., 2003): There are 4 types of
malignant glioma in this dataset: classic glioblastomas,
classic anaplastic oligodendrogliomas, nonclassic glioblastomas,
and nonclassic anaplastic oligodendrogliomas.
The dataset has 50 samples, and the number of genes
is 10367. 

All the datasets are normalized by rescaling the gene expression values to be between 0 and 1. 
Two methods are used in this experiment to study gene
selection’s impact on classification performance: Kruskal
Wallis nonparametric oneway ANOVA (KW) (Gibbons,
2003), and the ratio of between classes to within class sums
of square (BW) (Dudoit et al., 2002). 
Results 
Classification without Gene Selection 
Table 1shows the classification accuracy values obtained
by 10fold stratified cross validation for both l_{1} least square
and SVMs. The results of SVMs are slightly different from
what is reported by Stanikov et al. (2005) where the five
datasets are also used. A possible explanation is that the
distribution for cross validation in our study is different from
that in paper (Stanikov et al., 2005). 
Table 1: Performance without gene selection.


For binary datasets Prostate_Tumor and DLBCL, the performance
of l_{1} least square is slightly below that of SVMs.
Note that the results of SVMs are obtained by careful model
selection using cross validation, while our method does not
need model selection, and is totally automatic. In addition,
just like SVM, when applied to binary datasets, the
multicategory classifiers of l_{1} least square are equivalent to
binary classifier for both OVO and OVR approaches. 
When applied to classification of multicategory datasets,
OVR l_{1} least square can closely match the best performance achieved by SVMs. For both SVM and l_{1} least square, OVO
approach performs much worse than OVR approach for classifying
9_Tumors dataset. 
Classification with Gene Selection 
Table 2 shows the best performance achieved by OVR l_{1}
least square and SVMs when gene selection methods KW
and BW are used. The results show that both l_{1} least square
and SVMs perform slightly better compared with the performance
without gene selection reported in Table 1. The
improvement ranges from 0 to 9% for SVMS, while only
from 0 to 3.48% for OVRl_{1} least square. Again, the performance
of OVR l_{1} least square is comparable to SVMs. 
Table 2: Performance with gene selection.


Discussion 
The success of l_{1} least square may lie in its sparse linear
model coefficient vector obtained from l_{1} – norm minimization.
Figure 1 shows the model coefficient vector w which
is the solution of l_{1} least square for classifying binary dataset
DLBCL. The sparsity suggests that those genes with greater
absolute coefficients could have played more important roles
in classification. As a result, the classification performance
does not depend on all the genes, especially those with very
small absolute coefficients. The sparsity has the potential
to greatly alleviate curse of dimensionality and increase the
robustness to outliers. 
Another implication of sparsity is that those genes with larger absolute coefficients may correspond to biological
markers. Hence, sparsity could be also used for gene selection.
We did a small experiment to verify this possibility.
The binary dataset DLBCL is used to fit l_{1} least square
model. Gene selection is done by choosing M genes with M
largest absolute coefficients. Binary SVM is used to classify
the geneselected data. The results are compared with
KW and BW methods for gene selection. Figure 2 shows
the performance of the three gene selection methods for M
=10, 20, 30, 40, and 50, respectively. The new method significantly
outperforms both KW and BW methods when a
small number of genes are selected. 

Figure1: The sparse coefficient vector.



Figure2: The performance of three gene Selection
methods.


The above gene selection approach is in spirit similar to
lasso (Tibshirani, 1996) formulated as follows 
min  Xw  y _{2}^{2} subject to  w _{1} ≤ t (17) 
where X, w, and y follow the definitions given in section
2.1 for binary l_{1} least square, and t is the model parameter
for lasso. In addition, lasso can also be used in classification
by replacing (7) with (17) for binary case, (10) with 
min  Xw_{k}  y_{k} ^{2}_{2} subject to  w_{k} _{1} ≤ t_{k} (18) 
for multicategory OVR approach, and (14) with 
min  Xw_{i,j}  y_{i,j} ^{2}_{2} subject to  w_{i,j} _{1} ≤ t_{i,j} (19) 
for multicategory OVO approach. 
Similarly, we can also replace l1 least square regression
with Dantzig selector (Candès and Tao, 2007), which is
given below for binary classification 
min  w  subject to  X^{T}(y  Xw) _{∞} ≤ (1 + t^{1}) √(2logd.σ) (20) 
where t is model parameter, andσ is the noise standard deviation.
Dantzig selector for multicategory classification
can be similarly defined. 
Both lasso and Dantzig selector for classification, however,
still need to select optimized model parameters by
model selection procedure, such as cross validation. 
Conclusion 
In this paper, we have described a specialized regression
method for cancer diagnosis using expression data. The new
approach, called l_{1} least square, casts linear regression as a
constrained l_{1}norm minimization problem to overcome the
major drawback of least square for classification: lack of
robustness to outliers. Besides binary classifier,
multicategory l_{1} least square including OVO and OVR approaches
are also proposed. 
Numerical experiment shows that OVR l_{1} least square
can match the best performance achieved by SVMs with
careful model selection. The main advantage of l_{1} least
square over other methods including SVMs is that it has no
need of model selection. As a result, the method based on l_{1}
least square is totally automatic. l_{1} least square also has the
potential to be used for gene selection. 
The l_{1} least square classifier may become a promising
automatic cancer diagnosis tool by consistently distinguishing
gene profile classes. Those genes with great absolute
regression coefficients may serve as biological marker candidates
for further investigation. 
References 
 Bishop CM: Pattern recognition and machine learning.
New York: Springer; 2006. » Google Scholar
 Candès E, Romberg J, Tao T (2006) Robust uncertainty
principles: Exact signal reconstruction from highly incomplete
frequency information. IEEE Trans. on Information
Theory 52: 489509. » CrossRef » Google Scholar
 Candès E, Tao T (2006) Near optimal signal recovery
from random projections: Universal encoding strategies?
IEEE Trans. on Information Theory 52: 54065425.
 Candès E, Romberg J (2006) l1 magic: A Collection of
MATLAB Routines for Solving the Convex Optimization
Programs Central to Compressive Sampling
[Online]. Available: www.acm.caltech.edu/l1magic/
 Candès E, Tao T (2007) The Dantzig selector: Statistical
estimation when p is much larger than n. Ann Statist
35: 23132351.
» CrossRef » Google Scholar
 Chen S, Donoho D, Saunders M (2001) Atomic decomposition
by basis pursuit. SIAM Rev 43: 129159. » CrossRef » Google Scholar
 Crammer K, Singer Y (2000) On the learnability and
design of output codes for multiclass problems. Proceedings
of the Thirteen Annual Conference on Computational
Learning Theory. Standford University Palo Alto
CA June 28–July 1.
 DiazUriarte R, Alvarez de Andres S (2006) Gene selection
and classification of microarray data using random
forest. BMC Bioinformatics 7: 3. » CrossRef » PubMed » Google Scholar
 Donoho D (2006) Compressed sensing. IEEE Trans on
Information Theory 52: 12891306. » CrossRef » Google Scholar
 Dudoit S, Fridlyand J, Speed TP (2002) Comparison of
discrimination methods for the classification of tumors
using gene expression data. J Am Stat Assoc 97: 7787. » CrossRef » Google Scholar
 Friedlander M, Van den Berg E (2008) SPGL1, a solver
for large scale sparse reconstruction.
[Online]Available: http://www.cs.ubc.ca/labs/scl/spgl1/
 Gibbons JD: Nonparametric Statistical Inference, 4th
edition, CRC, 2003.
 Hastie T, Tibshirani R, Friedman J (2001) The elements
of statistical learning. New York: Springer. » CrossRef » Google Scholar
 Kressel U (1999) Pairwise classification and support
vector machines. In Advances in Kernel Methods: Support
Vector Learning, (Chapter 15.) Cambridge, MA:
MIT Press.
 Lee JW, Lee JB, Park M, Song SH (2005) An extensive
comparison of recent classification tools applied to
microarray data. Computational Statistics & Data Analysis
48: 869885. » CrossRef » Google Scholar
 Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross
JG, et al. (2003) Gene expressionbased classification
of malignant gliomas correlates better with survival than
histological classification. Cancer Res 63: 16021607. » CrossRef » PubMed » Google Scholar
 Platt JC, Cristianini N, ShaweTaylor J (2000) Large
margin DAGS for multiclassclassification. In Advances
in Neural Information Processing Systems 12. MIT
Press. » Google Scholar
 Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM,
Angelo M, et al. (2002) Prediction of central nervous
system embryonal tumour outcome based on gene expression.
Nature 415: 436442. » CrossRef » PubMed » Google Scholar
 Saunders M (2002) PDCO: PrimalDual Interior Method
for Convex Objectives [Online]. Available: http://www.stanford.edu/group/SOL/software/pdco.html.
 Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, et
al. (2002) Diffuse large Bcell lymphoma outcome prediction
by gene expression profiling and supervised
machine learning. Nat Med 8: 6874. » CrossRef » PubMed » Google Scholar
 Statnikov A, Wang L, Aliferis CF (2008) A comprehensive
comparison of random forests and support vector
machines for microarraybased cancer classification. BMC Bioinformatics 9: 319. » CrossRef » PubMed » Google Scholar
 Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo
MJ, et al. (2001) Chemosensitivity prediction by transcriptional
profiling. Proc Natl Acad Sci USA 98: 10787
10792. » CrossRef » PubMed » Google Scholar
 Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy
S (2005) A comprehensive evaluation of multicategory
classification methods for microarray gene expression
cancer diagnosis. Bioinformatics 21: 631643. » CrossRef » PubMed » Google Scholar
 Singh D, Febbo PG, Ross K, Jackson DG, Manola J, et
al. (2002) Gene expression correlates of clinical prostate
cancer behavior. Cancer Cell 203209. » CrossRef » PubMed » Google Scholar
 Su AI, Welsh JB, Sapinoso LM, Kern SG, Dimitrov P, et
al. (2001) Molecular classification of human carcinomas
by use of gene expression signatures. Cancer Res
61: 73887393. » CrossRef » PubMed » Google Scholar
 Tibshirani R (1996) Regression shrinkage and selection
via the lasso. J Roy Statist Soc ser B 58: 267288. » CrossRef » Google Scholar
 The MOSEK Optimization Tools Version 2.5. User’s
Manual and Reference 2002 [Online]. Available: www.mosek.com.
 Van den Berg E, Friedlander M (2008) Probing the
Pareto frontier for basis pursuit solution. Technical Report
2008, Department of Computer Science, University
of British Columbia.
 Vapnik VN: Statistical learning theory. New York:
Wiley; 1998.
 Weston J, Watkins C (1999) Support vector machines
for multiclass pattern recognition. In Proceedings of the
Seventh European Symposium On Artificial Neural
Networks (ESANN 99) Bruges April 2123.


This Article 
DOWNLOAD 

CONTRIBUTE 

SHARE 

EXPLORE 


