因为TCGA计划跨时太长,纳入研究的病人数量太多, 或多或少有点资料继续错误或者不完整,所以TCGA团队下功夫在计划结束后(April 2018)完整的系统性的公布了权威的临床资料。依托于文章 Cell. 2018 Apr 5 :[An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics]() Cell , April 2018 10.1016/j.cell.2018.02.052(link is external)
To ensure proper use of this large clinical dataset associated with genomic features, we developed a standardized dataset named the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR), which includes four major clinical outcome endpoints.
下载链接:TCGA-CDR
看起来是乱码,但的确是真实的下载地址:https://api.gdc.cancer.gov/data/1b5f413e-a8d1-4d10-92eb-7c4ae739ed81
题外话:关于不同数据源的TCGA临床资料冲突的讨论
关于生存分析的冲突问题,我们多次讨论了:
- 集思广益-生存分析可以随心所欲根据表达量分组吗
- 寻找生存分析的最佳基因表达分组阈值
比如下面的代码比较两个数据源;
代码语言:javascript复制rm(list = ls())
options(stringsAsFactors = F)
# 不同来源的生存信息,差异有点大。
## 来源于 XENA 数据源:
# https://gdc.xenahubs.net/download/TCGA-LAML/Xena_Matrices/TCGA-LAML.survival.tsv.gz
clin1=read.table('../data/TCGA-HNSC.survival.tsv.gz',header = T)[,2:4]
clin1$pid=substring(clin1[,2],1,12)
head(clin1)
clin1[,3]=clin1[,3]/30
clin1[clin1[,3] <0,3]=0
#重新读入TCGA-CDR数据
clin3=read_excel("./TCGA-CDR-SupplementalTableS1.xlsx",sheet=3,na="NA")
clin3 = as.data.frame(clin3)
rownames(clin3) = substring(clin3[,2],1,12)
clin3 = clin3[, -c(1:3)]
我在比较这两个文件的时候发现了不一致, 然后搜索解决方案居然找到了我以前华西医学院的学生的分享;https://www.jianshu.com/p/0a4a492b130e
会发现出现这样的差异,是因为对终点事件不统一造成的。在Xena的survival.tsv中定义的结局事件是死亡,在TCGA-CDR中,PFI.1定义的终点事件是疾病进展,包括死亡、复发、转移等。具体到病人TCGA-BA-5151,他可能是在术后517天发现有肿瘤复发,第722天失访,那么在Xena的生存分析中是定义为722天截尾,但是在TCGA-CDR中是517天事件发生。所以这两个变量不一致。散点图也反映了这一区别,CDR的PFI1.time总是不大于Xena的time2event。这一点在TCGA-CDR的表格文件中有解释
关于生存分析该选择哪个时间点
这不是一个选择题,既然人家TCGA组织整理了 four major clinical outcome endpoints. 那么这些时间点都可以,不同的选择得到的结果的生物学解释不一样。
DSS: disease-specific survival event, 1 for patient whose vital_status was Dead and tumor_status was WITH TUMOR. If a patient died from the disease shown in field of cause_of_death, the status of DSS would be 1 for the patient. 0 for patient whose vital_status was Alive or whose vital_status was Dead and tumor_status was TUMOR FREE. This is not a 100% accurate definition but is the best we could do with this dataset. Technically a patient could be with tumor but died of a car accident and therefore incorrectly considered as an event.
DSS.time: disease-specific survival time in days, last_contact_days_to or death_days_to, whichever is larger.
DFI: disease-free interval event, 1 for patient having new tumor event whether it is a local recurrence, distant metastasis, new primary tumor of the cancer, including cases with a new tumor event whose type is N/A. Disease free was defined by: first, treatment_outcome_first_course is "Complete Remission/Response"; if the tumor type doesn't have "treatment_outcome_first_course" then disease-free was defined by the value "R0" in the field of "residual_tumor"; otherwise, disease-free was defined by the value "negative" in the field of "margin_status". If the tumor type did not have any of these fields, then its DFI was NA.
DFI.time: disease-free interval time in days, new_tumor_event_dx_days_to for events, or for censored cases, either last_contact_days_to or death_days_to, whichever is applicable.
PFI: progression-free interval event, 1 for patient having new tumor event whether it was a progression of disease, local recurrence, distant metastasis, new primary tumors all sites , or died with the cancer without new tumor event, including cases with a new tumor event whose type is N/A.
PFI.time: progression-free interval time in days, for events, either new_tumor_event_dx_days_to or death_days_to, whichever is applicable; or for censored cases, either last_contact_days_to or death_days_to, whichever is applicable.