Identification of ANO6 as a Novel Prognostic Biomarker for Breast Cancer

Identification of ANO6 as a Novel Prognostic Biomarker for Breast Cancer

1. Introduction

Breast cancer (BC) is the most prevalent cancer worldwide amongst women, with a 5-year survival rate of 90% and a decreasing mortality rate [1]. Some of its risk factors include age, obesity, hormone use and a mutation in the BRCA1 and BRCA2 tumour suppressor genes [2]. The BC subtypes are classified based on the oestrogen receptor, progesterone receptor, and human epidermal growth factor receptor-2 status; they are also very important prognostic markers for BC [3] as there is a clear difference between the survival rates of the subtypes, even within different stages [4]. The aim of this study is to identify a novel prognostic gene, associated with survival, for BC patients using the publicly available TCGA dataset.

2. Methods

2.1 Data

The TCGA BC datasets analysed in this study were downloaded from Xena Browser; this included clinical, gene expression (RNA-sequencing), DNA methylation (DNAm), mutation, and protein expression data. The clinical data consists of the survival data, age, factors related to the donors, like hormone receptor statuses, for 1,236 donors. The gene expression dataset was experimentally measured via the Illumina HiSeq 2000 RNA Sequencing platform, and it was measured for 20,530 genes for 1,218 donors. The DNAm dataset was measured experimentally using the Illumina Infinium HumanMethylation450 platform and was represented using beta values. The gene mapping dataset needed to annotate the DNAm dataset was also downloaded from Xena Browser. This methylation beta values were measured for 485,577 probes (CpG sites) and for 888 donors. The mutation dataset was consisted of a binary variable for each gene, were 1 indicated a non-silent mutation and 0 a wild type. The dataset had the mutation states of 40,542 genes for 791 donors. The protein expression data, which was measure by reverse phase protein array (RPPA) technology, consisted of 131 proteins for 747 patients. Prior to analysing, the data was pre-processed to ensure homogeneity of donor IDs, as well as improve the quality of data by removing NAs, where applicable.

2.2 Identification of the prognostic gene & survival analysis

In order to identify the gene of interest, the gene expression (RNA-Seq) dataset was used. Initially, the dataset was pre-processed to check for any missing values (NAs) and genes which had an expression level of 0; none were found so the dataset was ready to be analysed. The next step was to align the RNA-Seq and survival’s donors, this resulted to datasets made up of 1,215 donors. Univariate survival analysis was carried out on the survival data to identify factors affecting overall survival, this was done via Cox proportional hazards (CoxPH) models. Their significance and affect were assessed using the hazard ratio (HR), p-values, and confidence intervals (CI). The HR is defined as the exponential of the coefficient of a variable. The significance level for the p-values was set at 0.05. For the univariate CoxPH models, the p-value was calculated using the Wald test, and that p-value was the one indicating the model’s significance. While for the multivariate CoxPH models, the model’s overall significance was assessed using the p-value of the Likelihood ratio test, and each variable’s p-value was calculated using the Wald test.

Three ‘for loops’ were created to extract the HR, 95% CIs, and p-values of the CoxPH models adjusted by the expression value of each gene. The first CoxPH model was adjusted only for the level of gene expression (a continuous variable), the second one for gene expression level and age, and the last model was adjusted for gene expression level, age and high stage. The high stage, a binary variable, was assigned a value of 1 if a donor was Stage 3 or Stage 4, else a 0. The p-values of each data frame were adjusted using the false discovery rate (FDR) approach to correct for multiple testing [5].

The gene which was most significant and not already associated with BC was chosen. Subsequently, a new variable was created for the expression level of that gene which indicated if a donor had a high or low expression level, this was done by splitting on the median. Using that binary variable, a Kaplan-Meier survival curve was generated, and its significance was assessed based on the p-value of the Log-rank test. The survival analysis was carried out using the survival R package.

2.3 Methylation Data

To understand the molecular mechanism of our chosen prognostic gene, the DNAm data was analysed corresponding to that specific gene. Using the Illumina Methylation450k annotation file, 40 probes were found associated with the candidate gene, which subsequently the DNAm dataset was filtered for. The DNAm dataset was then pre-processed to delete the probes that had more than 50% missing values (NAs); this then resulted in 28 associated probes. The DNAm dataset was then filtered for the overlapping donors with the gene expression dataset, and this yielded 870 donors. To investigate the relationship between the gene expression level and the associated probes a Spearman correlation test was carried out for each probe. A Spearman test was chosen over the Pearson test as the methylation data is not normally distributed. Their association was assessed using the p-value of the test and the correlation coefficient (Spearman ρ). To further analyse the methylation probes correlated with the gene expression levels, boxplots were generated by splitting them based on whether the probes had a positive or a negative correlation.

2.4 Mutation Analysis

The binary mutation dataset was filtered for the candidate gene, and for the donors overlapping with those of the gene expression dataset, this resulted to mutation data for 789 donors. Then, a Fisher’s exact test was performed on a two-by-two contingency table, indicating if a donor had a high or low gene expression and whether that candidate gene was mutated or not. The odds ratio (OR), 95% CI and p-value were calculated. Next, the effect of the mutation variable was assessed using a CoxPH model; HR, 95% CI and the p-value was calculated using the Wald test, and for the model’s overall significance using the Likelihood ratio test.

2.5 Protein Analysis

The RPPA dataset was used to identify which proteins correlate with the selected candidate gene, this was done via a linear model. The log2-fold-change, average log2-expression, moderated t-statistic, p-values and adjusted p-values were calculated. The limma R package was used to fit the model.

All the statistical analyses, data processing, and graphs in this study were performed in R (version 4.0.3) [6].

3. Results

3.1 Identification of ANO6 as the novel prognostic gene and its effect on survival of BRCA patients

Using the univariate CoxPH models to assess which factors affect OS of BRCA patients, the following were found significant: age (HR = 1.03, 95% CI: 1.02 – 1.04, p = 4.23E-8) and high stage (HR = 2.22, 95% CI: 1.65 – 2.97, p = 1.01E-7). Using those, the three ‘for loops’ were executed; the first one only adjusted by gene expression level, the second one by gene expression level and age, and the third one by gene expression level, age and high stage. From the univariate analysis, ANO6 (HR = 1.60, 95% CI: 1.34 - 1.88, p = 0.000175) was found to be the most significant. From the second model, ANO6 (HR = 1.55, 95% CI: 1.31 – 1.83, p = 0.0012) was third, however, the first two most significant (LOC148145, SIAH2) are already associated with BC. When also adjusting for high stage, ANO6 (HR = 1.54, 95% CI: 1.29 – 1.83, p = 0.0010) was the 17th most significant, and it was the first gene that is not already associated with BC. It should be noted that the p-values corresponding to the CoxPH models for ANO6 are the adjusted p-values.

image **_Figure 1 Kaplan-Meier plot of overall survival of BC patients stratified by ANO6 expression_**

The hazard ratio (HR) and 95% confidence interval (CI) of the survival curve is indicated on the plot. The p-value has been calculated using the log-rank test. The black colour indicates that the donor has a low ANO6 expression (n = 608), and the red colour a donor with a high ANO6 expression (n = 607). The high and low values were assigned by splitting on the median.


From the survival curve, in Figure 1, it is evident that a lower expression of ANO6 leads to better survival rates. The HR of 2.18 indicates that if a donor were to have a high expression, then they would have a 118% increase in risk of mortality.

3.2 Correlation between ANO6 expression and ANO6 methylation levels

After pre-processing the DNAm dataset there were 28 probes associated with ANO6. The Spearman correlation test showed that 20 out of the 28 probes had a significant (p < 0.05) correlation, see Table 1. From the 20 significant probes, 14 (70%) had a negative correlation, while 6 (30%) had a positive correlation, however for all of them, Spearman ρ shows a weak correlation. Thus, due to there being both a positive and negative correlation between the ANO6 gene expression and its methylation probes, this indicates that methylation is not likely to be the underlying molecular mechanism making ANO6 a possible biomarker.

| Probe | Spearman ρ | P-value | |---|---|---| | cg22441683 | 0.268 | 8.30e-16 | | cg01790771 | -0.266 | 1.76e-15 | | cg19326153 | -0.235 | 2.06e-12 | | cg17052212 | -0.228 | 9.13e-12 | | cg26332194 | -0.208 | 5.34e-10 | | cg21185662 | -0.204 | 1.21e-09 | | cg03162359 | 0.200 | 2.44e-09 | | cg25162927 | 0.198 | 3.78e-09 | | cg23336996 | -0.195 | 6.30e-09 | | cg12686055 | -0.163 | 1.43e-06 | | cg23310538 | -0.149 | 9.99e-06 | | cg19893032 | 0.141 | 2.97e-05 | | cg02045493 | -0.130 | 1.27e-04 | | cg23255486 | -0.126 | 1.96e-04 | | cg24949344 | 0.122 | 3.01e-04 | | cg17810878 | -0.114 | 7.48e-04 | | cg10892950 | -0.099 | 3.49e-03 | | cg07905273 | -0.085 | 1.17e-02 | | cg18470038 | 0.078 | 2.25e-02 | | cg13582818 | -0.071 | 3.51e-02 | | cg20037773 | 0.065 | 5.36e-02 | | cg26430465 | -0.059 | 8.19e-02 | | cg22251684 | -0.059 | 8.22e-02 | | cg21705221 | -0.021 | 5.42e-01 | | cg20883194 | -0.015 | 6.62e-01 | | cg07054062 | -0.013 | 7.07e-01 | | cg09909775 | -0.009 | 7.87e-01 | | cg10609068 | -0.002 | 9.51e-01 | **_Table 1 Spearman correlation coefficient and p-values between ANO6 expression and ANO6 methylation probes_**

The table has the Spearman ρ, which indicates the correlation coefficient, and the p-value of the correlation test. Those highlighted in blue indicate the significant probes. They are ordered from the most significant to the least significant based on the p-values.

image

Figure 2 Boxplot of Methylation Levels of the associated ANO6 probes correlated with its gene expression (A) Probes which are positively correlated with ANO6 gene expression (B) Probes which are negatively correlated with ANO6 gene expression

From the boxplots it is evident that a positive or negative correlation does not imply a certain level of methylation, i.e., hypomethylation or hypermethylation. However, it is evident that for the positively correlated (Figure 2A), the methylation levels of those probes tend to be at the extrema with a median methylation of less than 0.1 or greater than 0.87, if hypomethylated or hypermethylated, respectively. While, for the negatively correlated probes (Figure 2B), 6 (~ 43%) are hypomethylated with a median beta value below 0.1. This variation, in the beta values, does not provide a clear answer between the relationship of gene expression and methylation levels.

3.3 Association between ANO6 expression levels and ANO6 mutation state

After filtering for the overlapping donors for the mutation states of the ANO6 gene, out of the 789 donors only 7 had a mutation present. When performing the Fisher’s exact test on the two-by-two contingency table (of high versus low expression and mutated or not) there were no significant findings (OR = 0.85, 95% CI: 0.12 – 5.08, p = 1).

From the CoxPH model stratified by mutation state (HR = 3.65, 95% CI: 0.50 – 26.43, p = 0.2) there were no findings as well. However, when also adjusting for age, the model was overall significant (Likelihood ratio test: p = 1.4E-5) but not the mutation variable itself (HR = 3.24, 95% CI: 0.45 – 23.5, p = 0.25). The same was seen in the CoxPH model which was adjusting for age and high stage; the mutation variable was not significant (HR = 3.28, 95% CI: 0.45 – 23.84, p = 0.24), however, the model itself was (Likelihood ratio test: p = 1E-9).

3.4 Corelation between RPPA proteins and ANO6 expression

ProteinlogFCAvg. Exp.tp-valueAdj. p-value
CYCLINB1-0.419-0.694-9.301.54E-191.67E-17
ASNS-0.314-0.351-9.242.55E-191.67E-17
PCNA-0.114-0.220-7.413.52E-131.54E-11
MAPKpT202Y2040.3000.6287.307.57E-132.48E-11
CYCLIN E1-0.2050.596-7.221.31E-123.42E-11
S6-0.222-0.564-6.891.17E-112.56E-10
P53-0.162-0.142-6.171.10E-112.06E-08
FIBRONECTIN0.1970.9666.141.35E-092.21E-08
YAPPS1270.149-0.1416.032.56E-093.73E-08
CASPASE7CLEAVEDD 198-0.215-0.280-5.751.30E-081.71E-07

Table 2 The 10 most significant proteins associated with ANO6 expression levels in BC patients <\sub>

The logFC represents the log2-fold change, Avg. Exp. the average expression of the protein, t the moderated t-statistic and Adj. p-value the adjust p-value using the Benjamini & Hochberg method. From the fitted linear model on the RPPA dataset, 70 out of the 131 were significantly correlated (adjusted p-value < 0.05) with ANO6 expression. In Table 2, the 10 most significant proteins are seen. The relationship between ANO6 expression and cyclin B1 protein expression may be seen in Figure 3; there is a weak negative correlation between the two variables.

Figure 3 Scatter plot of ANO6 expression levels and Cyclin B1 protein expression

The Spearman correlation test was performed to assess the relationship between these two variables. Spearman’s ρ (coefficient) was -0.34 and its p-value < 2.2E-16.

4. Conclusion

In this study a potential novel prognostic marker was identified, ANO6 (anoctamin 6) for BC patients using TCGA data. There is evidence which suggests that ANO6 plays a role in regulated cell death, and thus activation may be used as a means of inducing death in cancerous cells [7]. A higher expression of ANO6 is associated with poorer prognosis in BC patients. There were no associations found between the mutation states of the gene with its expression levels. The associated methylation probes did not provide sufficient evidence, as there was both a positive and negative correlation between the beta values and the expression level, depending on the probe. The RPPA dataset analysis revealed a protein, Cyclin B1, which was negatively correlated with ANO6 expression levels.

Cyclin B1 plays a role in the G2 and M phases of the cell cycle [8]. Cyclin B1 has already been identified to play a role in BC. A study found that a higher cyclin B1 expression as a predictor of poorer overall and metastatic-free survival [9]. It has also been demonstrated that targeting its expression in combination with certain chemotherapy drugs could be a new gateway to treat BC, as cyclin B1 is needed for survival and proliferation of BC cells [10]. A meta-analysis concluded that cyclin B1 expression has an adverse effect on survival of lung cancer and oesophageal cancer patient [8].

5. References

  1. Siegel, R. L., Miller, K. D. and Jemal, A. (2020) ‘Cancer statistics, 2020’, CA: A Cancer Journal for Clinicians, 70(1), pp. 7–30. doi: 10.3322/caac.21590.

  2. Rojas, K. and Stuckey, A. (2016) ‘Breast Cancer Epidemiology and Risk Factors’, Clinical Obstetrics and Gynecology, 59(4), pp. 651–672. doi: 10.1097/GRF.0000000000000239.

  3. Kunc, M., Biernat, W. and Senkus-Konefka, E. (2018) ‘Estrogen receptor-negative progesterone receptor-positive breast cancer – “Nobody’s land“ or just an artifact?’, Cancer Treatment Reviews, 67(May), pp. 78–87. doi: 10.1016/j.ctrv.2018.05.005.

  4. Parise, C. A. and Caggiano, V. (2014) ‘Breast cancer survival defined by the er/pr/her2 subtypes and a surrogate classification according to tumor grade and immunohistochemical biomarkers’, Journal of Cancer Epidemiology, 2014. doi: 10.1155/2014/469251.

  5. Hochberg, Y. (2016) ‘Controlling the False Discovery Rate : A Practical and Powerful Approach to Multiple Testing Author ( s ): Yoav Benjamini and Yosef Hochberg Source : Journal of the Royal Statistical Society . Series B ( Methodological ), Vol . 57 , No . 1 ( 1995 ), Publi’, 57(1), pp. 289–300.

  6. R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

  7. Kunzelmann, K. et al. (2019) ‘Contribution of Anoctamins to Cell Survival and Cell Death’, Cancers, 11(3), p. 382. doi: 10.3390/cancers11030382.

  8. Ye, C. et al. (2017) ‘Prognostic role of cyclin B1 in solid tumors: A meta-analysis’, Oncotarget, 8(2), pp. 2224–2232. doi: 10.18632/oncotarget.13653.

  9. Aaltonen, K. et al. (2009) ‘High cyclin B1 expression is associated with poor survival in breast cancer’, British Journal of Cancer, 100(7), pp. 1055–1060. doi: 10.1038/sj.bjc.6604874.

  10. Androic, I. et al. (2008) ‘Targeting cyclin B1 inhibits proliferation and sensitizes breast cancer cells to taxol’, BMC Cancer, 8(1), p. 391. doi: 10.1186/1471-2407-8-391.

Appendix - R Code

##### Datasets #####

## Import the downloaded datasets for TCGA BRCA from Xena Browser
## These include clinical, survival, RPPA, mutations, RNAseq, DNA meth 450k, copy number & cnv

clinical    <- read.delim("BRCA_clinicalMatrix", header = T, row.names = 1)
survival    <- read.delim("BRCA_survival.txt", header = T, row.names = 1)
meth         <- read.delim("HumanMethylation450", header = T, row.names = 1)
mutations   <- read.delim("BRCA_mc3_gene_level.txt", header = T)
rppa         <- read.delim("RPPA", header = T, row.names = 1)
rnaseq      <- read.delim("HiSeqV2 2", header = T, row.names = 1)
CNV         <- read.delim("Gistic2_CopyNumber_Gistic2_all_data_by_genes", header = T, row.names=1)
copy.number <- read.delim( "Gistic2_CopyNumber_Gistic2_all_thresholded.by_genes", header = T, row.names = 1)

## Quality control & DF checking 

# for homogeneity and ease in data handling change the sampleID to have "." instead of "-", 
# then order alphabetically 

rownames(clinical)  <- gsub(rownames(clinical), pattern = "-", replace = ".")
rownames(survival) <- gsub(rownames(survival), pattern = "-", replace = ".")

clinical <- clinical[order(rownames(clinical)), ]
survival <- survival[order(rownames(survival)), ]

# order the column names and row names of the copy number, CNV, DNA meth 450k, RPPA, RNAseq,
# and mutations alphabetically

CNV         <- CNV[order(rownames(CNV)), order(colnames(CNV))]
copy.number <- copy.number[order(rownames(copy.number)), order(colnames(CNV))]
meth        <- meth[order(rownames(meth)), order(colnames(meth))]
mutations   <- mutations[order(rownames(mutations)), order(colnames(mutations))]
rppa        <- rppa[order(rownames(rppa)), order(colnames(rppa))]
rnaseq      <- rnaseq[order(rownames(rnaseq)), order(colnames(rnaseq))]


# remove the sample which is NA and make the sample column the rownames, then order alphabetically 

mutations <- mutations[-which(is.na(mutations[, 'sample'])),] %>% remove_rownames() %>% column_to_rownames('sample')  

mutations <- mutations[order(rownames(mutations)),]

# Check for NA entries

length(which(is.na(CNV)))   
length(which(is.na(copy.number)))
length(which(is.na(meth)))
length(which(is.na(mutations)))
length(which(is.na(rnaseq)))
length(which(is.na(rppa)))

# All have 0 NAs, besides the methylation df
# DNA meth 450k df has multiple rows which are all NA, remove those 
# and the probes that have > 50% NA entries

na.count <- apply(meth, 1, function(x) sum(as.numeric(is.na(x))))
exclude  <- as.numeric(na.count > .5 * ncol(meth))
meth2    <- meth[which(exclude == 0), ]


##### Predicting survival using gene expression  #####

# transpose the RNAseq df so the donors are the rows, and the genes the columns

rnaseq  <- as.data.frame(t(rnaseq))

# filter for the donors that are in the survival data 
# using table(rownames(rnaseq) %in% rownames(survival)) we observe that 1215 out of the 1218 IDs
# of the gene expression df are also in the survival df, so filter both df for those 1215 donors

rnaseq2  <- rnaseq %>% rownames_to_column('id') %>% filter(id %in% rownames(survival)) %>% column_to_rownames('id')
surv2    <- survival %>% rownames_to_column('id') %>% filter(id %in% rownames(rnaseq2)) %>% column_to_rownames('id')

# check that the rownames of the 2 new dfs are equal : table(rownames(rnaseq2) == rownames(surv2))


# create the df to store the results of the coxph model adjusted by gene expression level 

cox.rna1 <- as.data.frame(matrix(nrow = ncol(rnaseq2), ncol = 4))
rownames(cox.rna1) <- colnames(rnaseq2)
colnames(cox.rna1) <- c('HR', 'LCI', 'UCI', 'p.val')

for(i in 1:nrow(cox.rna1)) {
  
  model1 <- coxph(Surv(OS.time, OS) ~ rnaseq2[, i], surv2)
  
  cox.rna1$HR[i]    <- summary(model1)$coef[1,2]
  cox.rna1$LCI[i]   <- summary(model1)$conf.int[1,3]
  cox.rna1$UCI[i]   <- summary(model1)$conf.int[1,4]
  cox.rna1$p.val[i] <- summary(model1)$coef[1,5]
  
}

cox.rna1$adj.p.val <- p.adjust(cox.rna1$p.val, method = 'fdr')
cox.rna1           <- cox.rna1[order(cox.rna1$adj.p.val, decreasing = F), ]


# create the df to store the results of the coxph models adjusted by gene expression level and age

cox.rna2 <- as.data.frame(matrix(nrow = ncol(rnaseq2), ncol = 4))
rownames(cox.rna2) <- colnames(rnaseq2)
colnames(cox.rna2) <- c('HR', 'LCI', 'UCI', 'p.val')

for(i in 1:nrow(cox.rna2)) {
  
  model2 <- coxph(Surv(OS.time, OS) ~ rnaseq2[, i] + age, surv2)
  
  cox.rna2$HR[i]    <- summary(model2)$coef[1,2]
  cox.rna2$LCI[i]   <- summary(model2)$conf.int[1,3]
  cox.rna2$UCI[i]   <- summary(model2)$conf.int[1,4]
  cox.rna2$p.val[i] <- summary(model2)$coef[1,5]
  
}

cox.rna2$adj.p.val <- p.adjust(cox.rna2$p.val, method = 'fdr')
cox.rna2           <- cox.rna2[order(cox.rna2$adj.p.val, decreasing = F), ]


# create the df to store the results of the coxph models adjusted by gene expression level, age and high.stage

cox.rna3 <- as.data.frame(matrix(nrow = ncol(rnaseq2), ncol = 4))
rownames(cox.rna3) <- colnames(rnaseq2)
colnames(cox.rna3) <- c('HR', 'LCI', 'UCI', 'p.val')

for(i in 1:nrow(cox.rna3)) {
  
  model3 <- coxph(Surv(OS.time, OS) ~ rnaseq2[, i] + age + high.stage, surv2)
  
  cox.rna3$HR[i]    <- summary(model3)$coef[1,2]
  cox.rna3$LCI[i]   <- summary(model3)$conf.int[1,3]
  cox.rna3$UCI[i]   <- summary(model3)$conf.int[1,4]
  cox.rna3$p.val[i] <- summary(model3)$coef[1,5]
  
}

cox.rna3$adj.p.val <- p.adjust(cox.rna3$p.val, method = 'fdr')
cox.rna3           <- cox.rna3[order(cox.rna3$adj.p.val, decreasing = F), ]


# add the gene expression vector of ANO6 to the survival df

surv2$ano6 <- rnaseq2[ , 'ANO6']

# Create a high/low gene expression variable for ANO6 by splitting on the median

surv2$ano6.high <- ifelse(surv2$ano6 > median(surv2$ano6), 1, 0)


# Run CoxPH models by adjusting for the ANO6 variables (continuous and binary [high vs low])

# univariate adjusted by gene expression level

summary(coxph(Surv(OS.time, OS) ~ ano6, surv2))
summary(coxph(Surv(OS.time, OS) ~ ano6.high, surv2))

# multivariate adjusted by age and gene expression level

summary(coxph(Surv(OS.time, OS) ~ ano6 + age, surv2))
summary(coxph(Surv(OS.time, OS) ~ ano6.high + age, surv2))

# multivariate adjusted by age, high stage, and gene expression level

summary(coxph(Surv(OS.time, OS) ~ ano6 + age + high.stage, surv2))
summary(coxph(Surv(OS.time, OS) ~ ano6.high + age + high.stage, surv2))


# KM plot for OS of BRCA patients stratified by gene expression level of ANO6

ano6.OS <- survfit(formula = Surv(OS.time/365.25, OS) ~ ano6.high, data = surv2)

plot(ano6.OS, 
     col = c('black','red'), 
     xlim = c(0,24), 
     lwd = 2.5, 
     mark.time = T,
     xlab = "Time (years)",
     ylab = "Overall Survival Probability",
     cex.axis = 1.25,
     cex.lab = 1.50)
legend(x = "topright", col = c('black','red'), lwd = 2.5, 
       legend = c("Low ANO6 Expression (n = 608)", "High ANO6 Expression(n = 607)"))
text(x = 19.5, y = 0.88, "HR = 2.18 (95% CI: 1.61 - 2.94)", cex = 1.1)
text(x = 19.5, y = 0.83, "Log-rank p < 0.0001 ", cex = 1.1)

##### Analysis the relationship between gene expression and methylation level  #####

# Find the associated methylation probes (CpG sites) for ANO6 and filter the DNAm df

ano6.probes <- rownames(meth.annot %>% filter(UCSC_RefGene_Name == 'ANO6')) 

# There are 40 associated probes. 
# Filter the meth df

meth.ano6 <- meth[ano6.probes, ]

# using summary(t(meth.ano6)) [so columns are the probes] it is observed that 12 probes have only NA 
# entries, and thus, they cannot be analysed. So, those need to be removed, as well as any probes that have # more than 50% NA entries for the donors. 
 
na.probes  <- as.numeric(apply(meth.ano6, 1, function(x) sum(as.numeric(is.na(x)))) > .5 * ncol(meth.ano6))
meth.ano6  <- meth.ano6[which(exclude == 0), ]

# As the data has now been excluded of most NAs, we need to filter for the donors overlapping in both the
# gene expression data set and the mehtylation data set. 

# table(colnames(meth.ano6) %in% rownames(surv2)) shows that there are 870 in common

surv3     <- surv2 %>% filter(rownames(surv2) %in% colnames(meth.ano6))
ano6.meth <- meth.ano6[, rownames(surv3)]
ano6.meth <- as.data.frame(t(ano6.meth))

# Before deciding which corelation test to preform (Spearman vs Pearson) check the distribution of the beta
# values of each probe, if normally distributed choose Pearson, else Spearman 

# Check by performing the following code for each probe 
# Assess if the beta values are normally distributed based on the density curve
# ggplot(ano6.meth, aes(x = j)) + geom_density() : where ‘j’ is the probe name (colnames(ano6.meth)) 

# Also create the density curve for the distribution of the gene expression of ANO6
# ggplot(rnaseq2, aes(x=ANO6)) + geom_density() 

# Gene expression of ANO6 is normally distributed, however, the beta values are not normally distributed so # proceed with the Spearman corelation test

# Create a df to store in the results from the correlation test between methylation levels and gene expression

cor.meth.rna.ano6 <- as.data.frame(matrix(nrow = 28, ncol = 2))
rownames(cor.meth.rna.ano6) <- colnames(ano6.meth)
colnames(cor.meth.rna.ano6) <- c('coef', 'p.val')

# Create a 'for loop' to run all the cor.tests and extract the information needed

for (i in 1:nrow(cor.meth.rna.ano6)) {
  
  cor.meth.rna.test  <- cor.test(surv3[, 'ano6'], ano6.meth[, i], method = "spearman")
  
  cor.meth.rna.ano6$coef[i]   <- cor.meth.rna.test$estimate
  cor.meth.rna.ano6$p.val[i]  <- cor.meth.rna.test$p.value

}

# adjust for multiple testing 
cor.meth.rna.ano6$adj.p.val <- p.adjust(cor.meth.rna.ano6$p.val, method = 'fdr')

# arrange by p-value
cor.meth.rna.ano6 <- cor.meth.rna.ano6[order(cor.meth.rna.ano6$p.val, decreasing = F), ]

# calculate the average methylation value based on whether the donor has a high or low expression score

cor.meth.rna.ano6$avg.meth.low.ano6  <- apply(t(ano6.meth[which(surv3$ano6.high == 0), ]), 1, mean, na.rm = T)
cor.meth.rna.ano6$avg.meth.high.ano6 <- apply(t(ano6.meth[which(surv3$ano6.high == 1), ]), 1, mean, na.rm = T)

# Extract the 20 probes which are significant
sig.cor.meth.rna.ano6 <- rownames(cor.meth.rna.ano6)[which(cor.meth.rna.ano6$p.val < 0.05)]

# create a df for only the probes which had a significant correlation with the gene expression
ano6.meth.sig <- ano6.meth[, sig.cor.meth.rna.ano6]

# Plot the boxplots of each probe splitting based on their correlation coef (ie pos or neg)

# positively correlated probes

par(mar = c(4,4,3,1) + .1)
boxplot(ano6.meth.sig[, sig.cor.meth.rna.ano6[ which(cor.meth.rna.ano6[ sig.cor.meth.rna.ano6, 'coef' ] > 0 )]], 
        horizontal = F, 
        col = 'lightblue',
        ylab = 'Methylation Beta Values',
        xlab = 'ANO6 Methylation Probes')


# negatively correlated probes
par(mar = c(6,4,2,1) + .1)
boxplot(ano6.meth.sig[, sig.cor.meth.rna.ano6[ - which(cor.meth.rna.ano6[ sig.cor.meth.rna.ano6,  'coef' ] > 0 )]], 
        horizontal = F, 
        col = 'lightblue',
        las = 2,
        names = sig.cor.meth.rna.ano6[-which(cor.meth.rna.ano6[sig.cor.meth.rna.ano6, 'coef'] > 0)],
        ylab = 'Methylation Beta Values',
        xlab = 'ANO6 Methylation Probes')


##### Mutations #####

# Select the row corresponding to the binary mutation data for ANO6

ano6.mut <- mutations['ANO6', ]

# table(t(ano6.mut)) shows that there are 7 donors with the mutations present and 784 without

# perform a Fisher’s exact test on the 2x2 contingency table indicating high vs low gene expression & 
# whether ANO6 has a mutation present or not

fisher.test(surv.mut$ano6.high, surv.mut$ano6.mut)


# filter for the aligning donors of the surv/rna seq df and the mutation one

surv.mut <- surv2 %>% filter(rownames(surv2) %in% colnames(ano6.mut)) %>% 
              select(OS.time, OS, ano6, ano6.high, age, high.stage)

surv.mut$ano6.mut <- t(mutations['ANO6', rownames(surv.mut)])

# perform survival analysis (coxph) by stratifying by mutation state

# univariate
summary(coxph(Surv(OS.time, OS) ~ ano6.mut, data = surv.mut))


# multivariate, adjusted by age
summary(coxph(Surv(OS.time, OS) ~ ano6.mut + age, data = surv.mut))

# multivariate, adjusted by age & high.stage
summary(coxph(Surv(OS.time, OS) ~ ano6.mut + age + high.stage, data = surv.mut))

##### RPPA #####

# filter the donor ids for the overlaping donors with the rnaseq/surv dfs
# table( colnames(rppa) %in% rownames(surv2)) shows that all but 1 from the 747 donors align 

rppa2 <- rppa[, rownames(surv2)[which(rownames(surv2) %in% colnames(rppa))]]

my.design <- cbind(intercept = 1, ANO6 = rnaseq2[colnames(rppa2), 'ANO6']) 
rppa.fit <- lmFit(rppa2, design = my.design)
rppa.fit <- eBayes(rppa.fit)
ANO6.rppa.fit <- topTable(rppa.fit, coef = 2, number = nrow(rppa2))

# Check how many are significant
length(which(ANO6.rppa.fit$adj.P.Val < 0.05)) 

# there are 70 proteins which are significantly associated with ANO6 expression

# select the most significant one and assess the direct relationship with ANO6 gene expression
# rownames(ANO6.rppa.fit)[1] --> the most significant one is CYCLINB1

cyclinb1.rppa <- as.data.frame(t(rppa2['CYCLINB1',]))
cyclinb1.rppa.high <- ifelse(cyclinb1.rppa > median(cyclinb1.rppa$CYCLINB1), 1, 0)

surv.rppa <- surv2[colnames(rppa2), ]
surv.rppa$cyclinb1.rppa.high <- cyclinb1.rppa.high
surv.rppa$cyclinb1.rppa <- cyclinb1.rppa$CYCLINB1

# create a scatter plot to visualise the  between ANO6 expression & Cyclin B1 protein expression

plot(x = surv.rppa$cyclinb1.rppa, y = surv.rppa$ano6,
     xlab = 'Cyclin B1 Protein Expression',
     ylab = 'ANO6 RNA-Seq')
abline(lm(ano6 ~ cyclinb1.rppa, data = surv.rppa), col = 'red', lwd =2)

# run a Spearman’s correlation test to calculate the coef. and p-val 
# note that cyclin b1 is not normally distributed [checked via density plot – see code above]

cor.test(surv.rppa$ano6, surv.rppa$cyclinb1.rppa,  method = 'spearman')

# Spearman’s rho = -0.34 & p < 2.2e-16