现代生物学领域的生物信息学权重高吗

出版社希望我们《生信技能树》团队帮忙翻译整理一些相关领域（生物信息学）的书籍，我想起来了《现代生物学》系列书籍：《Methods in Molecular Biology》，就先系统性刷了一下这些标题，但是肉眼看过去，琳琅满目，很难掌握全貌。就想起来了爬虫词云这两个神器，现在让我们试试看吧

首先是爬虫获取全部的书籍的大标题和小标题

页面的网页规则是从1到272（截止日期：2023年07月09日）：

代码语言：javascript复制

https://www.springer.com/series/7651/books?page=2
https://www.springer.com/series/7651/books?page=272

书籍的数量一直在更新。。。

简单的使用谷歌浏览器的检查功能，就可以看到每个页面的书籍列表里面的书籍大标题是：

代码语言：javascript复制

<a href="https://www.springer.com/book/9781071634165" data-track="click" data-track-action="clicked article" data-track-label="article-8">RNA Nanostructures
                    </a>
                    
<a href="https://www.springer.com/book/9780896033191" data-track="click" data-track-action="clicked article" data-track-label="article-0">Yeast Protocols
                    </a>

而每个书籍还有一个次级标题是：

代码语言：javascript复制

<p class="u-text-md u-mb-8" data-test="book-sub-title">
                        Methods in Cell and Molecular Biology
                    </p>

接下来就是使用 rvest 包进行这些网页的解析而已，全部的代码如下所示：

代码语言：javascript复制

# 安装和加载rvest包
if (!require(rvest)) {
  install.packages("rvest")
}
library(rvest)

# 定义要爬取的URL
urls <- paste0("https://www.springer.com/series/7651/books?page=",1:272)

titles_txt <- lapply(urls, function(url){
  print(url)
  # 读取网页内容
  tryCatch(    webpage <- read_html(url) ,      
             error = function(e) print(paste0('error:',url)) )
  
  Sys.sleep(sample(1:10,1))
  # 使用CSS选择器或XPath来定位和提取你想要的信息 
  # 你可能需要根据实际的HTML结构来调整这个选择器
  # data-track-action="clicked article"
  main_text <- webpage %>% html_nodes("a[data-track-action='clicked article']") %>% html_text(trim = TRUE)
  # 打印提取到的文本
  # print(main_text)
  # data-test="book-sub-title"
  sub_text <- webpage %>% html_nodes("p[data-test='book-sub-title']") %>% html_text(trim = TRUE)
  # 打印提取到的文本
 # print(sub_text)
  return(list(
    main_text=main_text,
    sub_text=sub_text
  ))
})

上面的代码获取全部的书籍的大标题和小标题，接下来就是针对它们的标题内容进行一个简单的汇总整理。简单的看了看生物信息学相关非常少：

代码语言：javascript复制

 [1] "Plant Bioinformatics"                                     
 [2] "RNA Bioinformatics"                                       
 [3] "Translational Bioinformatics for Therapeutic Development" 
 [4] "Bioinformatics for Cancer Immunotherapy"                  
 [5] "Structural Bioinformatics"                                
 [6] "Microarray Bioinformatics"                                
 [7] "Cancer Bioinformatics"                                    
 [8] "Bioinformatics and Drug Discovery"                        
 [9] "Bioinformatics in MicroRNA Research"                      
[10] "Protein Bioinformatics"                                   
[11] "Proteome Bioinformatics"                                  
[12] "Bioinformatics"                                           
[13] "Bioinformatics"                                           
[14] "RNA Bioinformatics"                                       
[15] "Bioinformatics and Drug Discovery"                        
[16] "Bioinformatics for Comparative Proteomics"                
[17] "Bioinformatics for Omics Data"                            
[18] "Clinical Bioinformatics"                                  
[19] "Next Generation Microarray Bioinformatics"                
[20] "Plant Bioinformatics"                                     
[21] "Proteome Bioinformatics"                                  
[22] "Bioinformatics Methods in Clinical Research"              
[23] "Bioinformatics"                                           
[24] "Bioinformatics"                                           
[25] "Bioinformatics and Drug Discovery"                        
[26] "Bioinformatics for DNA Sequence Analysis"                 
[27] "Microbial Gene Essentiality: Protocols and Bioinformatics"
[28] "Plant Bioinformatics"                                     
[29] "Bioinformatics Methods and Protocols"

其次是词云对标题进行汇总

简单的使用bing搜索一下关键词：word clound in r ，就可以找到解决方案，第一个链接就是：http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know，代码分成5个步骤。

Step 1: Create a text file
Step 2 : Install and load the required packages
Step 3 : Text mining
Step 4 : Build a term-document matrix
Step 5 : Generate the Word cloud

一般来说，会R基础的朋友们很容易看懂，如果你还不会R语言，建议看：

《生信分析人员如何系统入门R(2019更新版)》
《生信分析人员如何系统入门Linux(2019更新版)》

把R的知识点路线图搞定，如下：

了解常量和变量概念
加减乘除等运算（计算器）
多种数据类型（数值，字符，逻辑，因子）
多种数据结构（向量，矩阵，数组，数据框，列表）
文件读取和写出
简单统计可视化
无限量函数学习

核心代码就是wordcloud函数，但是这个wordcloud函数要求的输入数据格式，就需要懂R语言的才能认真做出来。

代码语言：javascript复制


library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
wd <- function(text){
  # Load the data as a corpus
  docs <- Corpus(VectorSource(text))
  toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
  docs <- tm_map(docs, toSpace, "/")
  docs <- tm_map(docs, toSpace, "@")
  docs <- tm_map(docs, toSpace, "\|")
  # Convert the text to lower case
  docs <- tm_map(docs, content_transformer(tolower))
  # Remove numbers
  docs <- tm_map(docs, removeNumbers)
  # Remove english common stopwords
  docs <- tm_map(docs, removeWords, stopwords("english"))
  # Remove your own stop word
  # specify your stopwords as a character vector
  docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
  # Remove punctuations
  docs <- tm_map(docs, removePunctuation)
  # Eliminate extra white spaces
  docs <- tm_map(docs, stripWhitespace)
  # Text stemming
  # docs <- tm_map(docs, stemDocument)
  
  dtm <- TermDocumentMatrix(docs)
  m <- as.matrix(dtm)
  v <- sort(rowSums(m),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v)
  head(d, 10)
  set.seed(1234)
  wordcloud(words = d$word, freq = d$freq, min.freq = 1,
            max.words=200, random.order=FALSE, rot.per=0.35, 
            colors=brewer.pal(8, "Dark2"))
}
wd(unlist(lapply(titles_txt, '[[',1)))
wd(unlist(lapply(titles_txt, '[[',2)))

值得注意的是，如果并没有指定随机数种子，那么词云绘图结果每次布局都不一样哦。

基本上可以看到《现代生物学》所涉及的内容：

《现代生物学》是一个广泛的概念，它涵盖了生物学的许多不同领域，包括但不限于分子生物学、细胞生物学、生物化学、遗传学、生物物理学、生物信息学、生态学、进化生物学等。这些领域都在不断地发展和进步，以适应科学和技术的快速发展。在《现代生物学》中，有几个关键的主题和趋势：

分子和细胞生物学：这是现代生物学的核心，包括研究生命的基本单位——细胞，以及细胞内的分子过程。
遗传学和基因组学：随着测序技术的发展，我们现在可以快速、准确地测定个体的基因组，这为研究遗传疾病、进化和生物多样性提供了强大的工具。
生物信息学和计算生物学：随着生物数据的爆炸性增长，如何有效地存储、分析和解释这些数据成为了一个重要的问题。生物信息学和计算生物学就是解决这些问题的学科。
系统生物学：这是一个试图理解生物系统的整体行为的领域，而不仅仅是研究单个的基因或蛋白质。
生态学和环境生物学：随着人类对地球环境的影响越来越大，理解生态系统的结构和功能，以及我们如何影响它们，变得越来越重要。
生物技术和合成生物学：利用生物系统来解决实际问题，如生产药物、生物燃料和其他有用的化合物，以及设计和构建新的生物系统。

这些只是现代生物学的一部分领域，实际上，现代生物学的范围和深度远超这些。

bioinformatics 函数入门数据系统

0 人点赞