rm(list=ls())

expression = runif(20)

expression = expression[order(expression,decreasing = T)]

types = c(rep(1,times=10),rep(2,times=10))

data = data.frame(expression=expression,types=types)

names = paste(‘exon_’,1:20,sep=””)

rownames(data)=names

barplot(data$expression,col = c(“deepskyblue”,”gray”)[as.factor(data$types)],horiz=F,border = F,las=2,names=rownames(data),xlab=”Splicing genes”,ylab=”conservation scores”,cex.names = 0.6,cex.lab = 1.5)

# An example script of creating a heatmap using the heatmap.2 function in R

# Just copy all below into R to test

# Notes: the** styles of “double quotes”** when copied may change and make the script not run normally, but you can manually change the “double quotes”.

rm(list=ls())

require(gplots)

X=matrix(rnorm(120),nrow=20)

breaks = 20 # number of colors

my_palette <- colorRampPalette(c(“green”,”red”))(n =breaks-1) # self-define and discretize color maps

heatmap.2(X,Colv = NA,col=my_palette,breaks = breaks,density.info=”none”,trace = “none”,scale=”none”)

# Venn diagram in R

# personal notes

library(VennDiagram)

data=list(A=c(1,2,3,4,5,6),B=c(3,4,5,6,7),C=c(5,6,7,8,9,10)) # demo data

venn.diagram(data,”vennFigure.tiff”,fill = c(“cornflowerblue”, “green”, “yellow”)) # venn diagram saved to a file

# File handle arrays in PERL

# for personal notes

@group=(0,1,2,3,4);

# generate files and file handles

foreach $i (@group){

$train[$i]=”train_$i.txt”; # file name

$FILE[$i]=”FILE”.”$i”; # file handle

open $FILE[$i],”>$train[$i]” or die;

}

# write to file

foreach $i (@group){

print {$FILE[$i]} “test\n”;

close $FILE[$i];

}

# convert between RefSeq, Entrez and Ensembl gene IDs using R package biomaRt

# R code for convert gene/transcripts names

# personal notes

require(biomaRt)

# settings

features=c(“ensembl_gene_id”,”ensembl_transcript_id”,”entrezgene”,”external_gene_name”,”refseq_mrna”)

host=”www.ensembl.org”

# human

human_file=”geneID_mapping_human.tsv”

hmart=useMart(“ENSEMBL_MART_ENSEMBL”, dataset=”hsapiens_gene_ensembl”,host=host)

results=getBM(attributes=features,mart=hmart)

write.table(results,human_file,sep=”\t”,quote=FALSE,row.names=FALSE,col.names=TRUE)

# mouse

mouse_file=”geneID_mapping_mouse.tsv”

mmart=useMart(“ENSEMBL_MART_ENSEMBL”, dataset=”mmusculus_gene_ensembl”,host=host)

results=getBM(attributes=features,mart=mmart)

write.table(results,mouse_file,sep=”\t”,quote=FALSE,row.names=FALSE,col.names=TRUE)

# Partial Least Squares (PLS) code and basics

There are various ways to implement PLS, including the NIPALS, SIMPLS and the bi-diagonalizaton method of Rolf Manne. Here I provide a code based Wold’s 2001 paper: Chemometr. Intell. Lab. 58(2001)109-130.

**Input**:

**X**: data matrix of size n x p

**Y**: response variable of size n x 1

A: the number of PLS components to extract, which is usually optimized by cross validation.

**Output:**

**B**: a p-dimensional regression vector, where p equals the number of columns in **X**. If you want to add an intercept in your model, just add an additional column of ones to **X**.

**T**: PLS component or score matrix of size n x A. Can be thought of dimension-reduced representation of **X**. Similar to principal components in PCA but obtained in a different way.

**Wstar: [Wstar1, Wstar2,…,WstarA],** weight matrix to calculate **T** from original input **X**. Mathematically, **T=XWstar**.

**W**: **[W1, W2,…,WA], **weight matrix to calculate **T** from the *residual- X* at each iteration. Note that

**W**is different from

**Wstar**in addition to

**W1=Wstar1**.

**P**: Loading matrix. **X=TP’+E**

**R2X:** a A-dimensional vector, records the explained variance of **X** by each PLS component

**R2Y**: a A-dimensional vector, records the explained variance of **Y** by each PLS component

**Code**: copy the whole below and save as a function.

function [B,Wstar,T,P,Q,W,R2X,R2Y]=pls_basic(X,Y,A)

%+++ The NIPALS algorithm for both PLS-1 (a single y) and PLS-2 (multiple Y)

%+++ The model is assumed to be: Y=XB+E,where E is random errors.

%+++ X: n x p matrix

%+++ Y: n x m matrix

%+++ A: number of latent variables

%+++ Code: Hongdong Li, lhdcsu@gmail.com, Feb, 2014

%+++ reference: Wold, S., M. Sj?str?m, and L. Eriksson, 2001. PLS-regression: a basic tool of chemometrics,

% Chemometr. Intell. Lab. 58(2001)109-130.

varX=sum(sum(X.^2));

varY=sum(sum(Y.^2));

for i=1:A

error=1;

u=Y(:,1);

niter=0;

while (error>1e-8 && niter<1000) % for convergence test

w=X’*u/(u’*u);

w=w/norm(w);

t=X*w;

q=Y’*t/(t’*t); % regress Y against t;

u1=Y*q/(q’*q);

error=norm(u1-u)/norm(u);

u=u1;

niter=niter+1;

end

p=X’*t/(t’*t);

X=X-t*p’;

Y=Y-t*q’;

%+++ store

W(:,i)=w;

T(:,i)=t;

P(:,i)=p;

Q(:,i)=q;

end

%+++ calculate explained variance

R2X=diag(T’*T*P’*P)/varX;

R2Y=diag(T’*T*Q’*Q)/varY;

Wstar=W*(P’*W)^(-1);

B=Wstar*Q’;

Q=Q’;

%+++

# Random Frog and results interpretation

Random Frog is a variable selection algorithm that computationally assign a probability to each variable as a measure of its IMPORTANCE in a predictive model.

CARS, Competitive Adaptive Reweighted Sampling, is a variable or feature selection methods, especially suitable for p>>n setting and data with high correlation (coupled with PLS). Here I will give an example to detail how to use CARS and the meaning of its output.

Suppose that you have MATLAB installed (I am using 2010a version on Mac), downloaded the libPLS_1.95 and have added libPLS_1.95 in the search path.

First, let’s get our data. Copy the codes below to the command window and press ENTER:

*%+++ load data*

*load(‘corn_m51.mat’);*

*[n,p]=size(X);*

Keep in mind that this data has n=80 samples which were measured on p=700 wavelengths. The target variable to model is stored in vector **y**. In this p>>n setting, ordinary least squares (OLS) can not be used to build a model **y=Xb**. Why? because **b** is calculated as **b=inv(X’*X)X’y **(inv indicates inverse, ‘ indicates transpose) in OLS. Due to p > n, **X’*X **is not full rank and hence do NOT have its inverse and therefore OLS fails in this situation. One type of solution is to use regularized methods such as ridge regression or latent variable-based methods such as principal component regression (PCR) and partial least squares (PLS). Why not simply reduce p? A good idea! This motivates another type of solution which is the well known branch of statistics: variable selection or called feature selection.

Then, we split the data into a training and a test set, which will be used for model training and validation, respectively. There are different ways to split data, such as the KS (Kennard-Stone) method and random partition. Here, I will randomly split the data using the codes below:

%+++ split data into a training and test set

perm=randperm(n);

ktrain=perm(1:60); % 60 samples as training

ktest=perm(61:80); % 20 samples as test

Xtrain=X(ktrain,:);

Xtest=X(ktest,:);

ytrain=y(ktrain);

ytest=y(ktest);

At this point, data is ready! We can start to select a subset of variables using the CARS method. One may ask why not use the full spectral data? A good question to think about. Run the codes below to select variables (wavelengths):

%+++ perform variable selection using only the training data

A=10; % maximum number of PLS components

fold=5; % fold for cross validation

method=’center’; % for internal pretreatment

num=50; % number of iterations of CARS, 50 is default

CARS=carspls(Xtrain,ytrain,A,fold,method,num,0,0);

To see the output of CARS, type “CARS” in command window and press ENTER and you will see:

>> CARS

CARS =

W: [700×50 double]

time: 2.8165

RMSECV: [1×50 double]

RMSECV_min: 3.0894e-04

Q2_max: 1.0000

iterOPT: 49

optLV: 2

ratio: [1×50 double]

vsel: [405 505]

Technically, CARS is a structural data with different components like optLV as shown above. Your can use CARS.RMSECV to fetch the value of RMSECV. The meaning of each component of CARS is given here:

W: [700×50 double] % weight of each variable at each iteration.

time: 2.8165

RMSECV: [1×50 double] % Root Mean Squares Errors of Cross Validation, one for each iteration

RMSECV_min: 3.0894e-04 % the minimum of RMSECV

Q2_max: 1.0000 % Q2 is the cross validated R2.

iterOPT: 49 % the optimal iteration number (corresponding to the RMSECV_min)

optLV: 2 % the optimal number of Latent Variables of PLS model at iterOPT.

ratio: [1×50 double]

vsel: [405 505] % The selected variables. THIS IS WHAT WE WANT.

Recall that this data contains p=700 variables. Here only two out of the 700 were selected to be useful by CARS: the 405th and the 505th. To understand the behavior of CARS, let’s issue the following command:

plotcars(CARS)

which gives you a plot:

In Figure 1, the upper panel shows the number of variables kept by CARS at each iteration; the middle panel plots the the RMSECV at each iteration, which shows that RMSECV decreases with reduced number of variables (interesting? why?); the bottom panel provides the regression coefficients of each of the 700 variables at each iteration, called Regression Coefficient Paths (RCP) where each line denotes the RCP of one variable. Eliminated variables will have RCP ending with zero (thus you can not see it); some variables will have their RCP first increase and then decrease and finally drop to zero (notice iterations around 30); some variables survived along the whole 50 iterations of CARS and are** winners** and are selected by CARS. Notice that there two lines (just the 405th and 505th variables) clearly stand out!

We have selected 2 variables. Then let’s check out how only these two will perform. The variable-reduced data is:

Xtrain_sub=Xtrain(:,[405 505]);

Xtest_sub=Xtest(:,[405 505]);

The we build a model and test it:

model=pls(Xtrain_sub,ytrain,2);

ypred=plsval(model,Xtest_sub,ytest)

An intuitive way to check model performance is to plot predicted value against the experimental value:

plot(ytest, ypred,’b.’);

xlabel(‘Experimental’);

ylabel(‘Predicted’);

This gives you a plot:

You want to give a quantitative measurement of model performance. Usually, you calculate RMSEP (Root Mean Squared Errors of Prediction):

RMSEP=sqrt(sumsqr(ypred-ytest)/length(ytest));

which is 3.0285e-04 in this case, very small (this data is very special).

So far, we have selected a subset of variables (405 and 505) using CARS and tested the performance of this subset on an independent test set. You may ask whether using a subset will improve model performance over the full-spectral model? It will as long as the full-model contains information that is irrelevant to the target variable you are predicting. As a test case, you can build a model using the full training data and tested it on the full test set and compares it to the simpler model with only two selected variables.