TCGA数据库临床资料官方大全

因为TCGA计划跨时太长，纳入研究的病人数量太多，或多或少有点资料继续错误或者不完整，所以TCGA团队下功夫在计划结束后（April 2018）完整的系统性的公布了权威的临床资料。依托于文章 Cell. 2018 Apr 5 ：[An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics]() Cell , April 2018 10.1016/j.cell.2018.02.052(link is external)

To ensure proper use of this large clinical dataset associated with genomic features, we developed a standardized dataset named the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR), which includes four major clinical outcome endpoints.

下载链接：TCGA-CDR

看起来是乱码，但的确是真实的下载地址：https://api.gdc.cancer.gov/data/1b5f413e-a8d1-4d10-92eb-7c4ae739ed81

题外话：关于不同数据源的TCGA临床资料冲突的讨论

关于生存分析的冲突问题，我们多次讨论了：

集思广益-生存分析可以随心所欲根据表达量分组吗
寻找生存分析的最佳基因表达分组阈值

比如下面的代码比较两个数据源；

代码语言：javascript复制

rm(list = ls())
options(stringsAsFactors = F) 
# 不同来源的生存信息，差异有点大。

## 来源于 XENA 数据源：
# https://gdc.xenahubs.net/download/TCGA-LAML/Xena_Matrices/TCGA-LAML.survival.tsv.gz
clin1=read.table('../data/TCGA-HNSC.survival.tsv.gz',header = T)[,2:4]
clin1$pid=substring(clin1[,2],1,12)
head(clin1)
clin1[,3]=clin1[,3]/30
clin1[clin1[,3] <0,3]=0

#重新读入TCGA-CDR数据
clin3=read_excel("./TCGA-CDR-SupplementalTableS1.xlsx",sheet=3,na="NA")
clin3 = as.data.frame(clin3)
rownames(clin3) = substring(clin3[,2],1,12)
clin3 = clin3[, -c(1:3)]

我在比较这两个文件的时候发现了不一致，然后搜索解决方案居然找到了我以前华西医学院的学生的分享;https://www.jianshu.com/p/0a4a492b130e

会发现出现这样的差异，是因为对终点事件不统一造成的。在Xena的survival.tsv中定义的结局事件是死亡，在TCGA-CDR中，PFI.1定义的终点事件是疾病进展，包括死亡、复发、转移等。具体到病人TCGA-BA-5151，他可能是在术后517天发现有肿瘤复发，第722天失访，那么在Xena的生存分析中是定义为722天截尾，但是在TCGA-CDR中是517天事件发生。所以这两个变量不一致。散点图也反映了这一区别，CDR的PFI1.time总是不大于Xena的time2event。这一点在TCGA-CDR的表格文件中有解释

关于生存分析该选择哪个时间点

这不是一个选择题，既然人家TCGA组织整理了 four major clinical outcome endpoints. 那么这些时间点都可以，不同的选择得到的结果的生物学解释不一样。

DSS: disease-specific survival event, 1 for patient whose vital_status was Dead and tumor_status was WITH TUMOR. If a patient died from the disease shown in field of cause_of_death, the status of DSS would be 1 for the patient. 0 for patient whose vital_status was Alive or whose vital_status was Dead and tumor_status was TUMOR FREE. This is not a 100% accurate definition but is the best we could do with this dataset. Technically a patient could be with tumor but died of a car accident and therefore incorrectly considered as an event.

DSS.time: disease-specific survival time in days, last_contact_days_to or death_days_to, whichever is larger.

DFI: disease-free interval event, 1 for patient having new tumor event whether it is a local recurrence, distant metastasis, new primary tumor of the cancer, including cases with a new tumor event whose type is N/A. Disease free was defined by: first, treatment_outcome_first_course is "Complete Remission/Response"; if the tumor type doesn't have "treatment_outcome_first_course" then disease-free was defined by the value "R0" in the field of "residual_tumor"; otherwise, disease-free was defined by the value "negative" in the field of "margin_status". If the tumor type did not have any of these fields, then its DFI was NA.

DFI.time: disease-free interval time in days, new_tumor_event_dx_days_to for events, or for censored cases, either last_contact_days_to or death_days_to, whichever is applicable.

PFI: progression-free interval event, 1 for patient having new tumor event whether it was a progression of disease, local recurrence, distant metastasis, new primary tumors all sites , or died with the cancer without new tumor event, including cases with a new tumor event whose type is N/A.

PFI.time: progression-free interval time in days, for events, either new_tumor_event_dx_days_to or death_days_to, whichever is applicable; or for censored cases, either last_contact_days_to or death_days_to, whichever is applicable.

dataset event free status time

0 人点赞