数据可视化有意思的小例子：Taylor Swift 歌词数据分析和可视化

原文地址

Data Visualization and Analysis of Taylor Swift’s Song Lyrics

Taylor Swift

英语学习时间

Taylor Swift

She is the youngest person to single-handedly write and perform a number-one song on the Hot Country Songs chart published by Billboard magazine in the United States.
Apart from that she is also the recipient of 10 Grammys, one Emmy Award, 23 Billboard Music Awards, and 10 Country Music Association Awards.
song lyrics 歌词

数据集

Taylor Swift 6 张专辑（album）96首歌的歌词 6列数据

歌手名 artist
专辑名 album name
歌名 track title
专辑中第几首歌 track number
歌词（每句一行）lyric
歌词是这首歌的第几句 line number
发表年份 year of release of the album

主要的分析内容

探索性数据分析

每首歌和每张专辑的歌词的单词数量
单词数量随着年份的变化
单词数量的频率分布

文本挖掘

词云
bigram network (暂时还不太明白这个是什么意思)
情感分析（sentiment analysis）

使用的工具是R语言

探索性数据分析

接触到一个新的函数：stringr包中的str_count() 帮助文档中的例子

代码语言：javascript复制

library(stringr)
fruit <- c("apple", "banana", "pear", "pineapple")
str_count(fruit, "a")
#输出结果是
[1] 1 3 1 1

作用是统计每个字符串中符合特定规则的字符的数量比如

代码语言：javascript复制

str_count("A B C","\S ")

输出的是“A B C”字符串中非空字符的数量（S 是正则表达式的一种写法，自己还没有掌握）读入数据

代码语言：javascript复制

lyrics<-read.csv("taylor_swift_lyrics_1.csv",header=T)
head(lyrics)

计算每句歌词的长度

代码语言：javascript复制

library(stringr)
lyrics$length<-str_count(lyrics$lyric,"\S ")
head(lyrics)

计算每首歌的歌词长度

代码语言：javascript复制

library(dplyr)
length_df<-lyrics%>%
  group_by(track_title)%>%
  summarise(length=sum(length))
head(length_df)
dim(length_df)

第一项内容：单词数量最多的10首歌

代码语言：javascript复制

Top10wordCount<-arrange(length_df,desc(length))%>%
  slice(c(1:10))
library(ggplot2)
ggplot(Top10wordCount,aes(x=reorder(track_title,length),y=length)) 
  geom_col(aes(fill=track_title)) coord_flip() 
  ylab("Word count")   xlab ("")   
  ggtitle("Top 10 songs in terms of word count")   
  theme_minimal() 
  theme(legend.position = "none")

image.png 从上图可以看到，单词数量最多的歌是 End Game 排名第二的是 Out of the Woods

第二项内容：单词数最少的10首歌

代码语言：javascript复制

Top10wordCount<-arrange(length_df,length)%>%
  slice(c(1:10))
library(RColorBrewer)
color<-rainbow(10)
ggplot(Top10wordCount,aes(x=reorder(track_title,-length),y=length)) 
  geom_col(aes(fill=track_title)) coord_flip() 
  ylab("Word count")   xlab ("")   
  ggtitle("Top 10 songs in terms of word count")   
  theme_minimal() scale_fill_manual(values = color) 
  theme(legend.position = "none") 
  theme(legend.position = "none")

image.png 单词数量最少的歌是 Sad Beautiful Tragic，发布于2012年，是 Red 这张专辑中的歌

第三项内容：单词数量的频率分布

代码语言：javascript复制

ggplot(length_df, aes(x=length))   
  geom_histogram(bins=30,aes(fill = ..count..))   
  geom_vline(aes(xintercept=mean(length)),
             color="#FFFFFF", linetype="dashed", size=1)  
  geom_density(aes(y=25 * ..count..),alpha=.2, fill="#1CCCC6")  
  ylab("Count")   xlab ("Legth")   
  ggtitle("Distribution of word count")   
  theme_minimal()

image.png

第四项内容：每张专辑的单词数量

代码语言：javascript复制

lyrics %>% 
  group_by(album,year) %>% 
  summarise(length = sum(length))%>%
  na.omit()-> length_df_album
length_df_album
ggplot(length_df_album, aes(x= reorder(album,-length), y=length))  
  geom_bar(stat='identity', fill="#1CCCC6")   
  ylab("Word count")   xlab ("Album")   
  ggtitle("Word count based on albums")   
  theme_minimal()

image.png

第五项内容：每张专辑单词数量随时间的变化趋势

代码语言：javascript复制

length_df_album %>% 
  arrange(desc(year)) %>% 
  ggplot(., aes(x= factor(year), y=length, group = 1))  
  geom_line(colour="#1CCCC6", size=1)   
  ylab("Word count")   xlab ("Year")   
  ggtitle("Word count change over the years")   
  theme_minimal() 
  geom_point(aes(x=factor(year),y=length,
                 size=length,color=factor(year)),
             alpha=0.5) 
  scale_size_continuous(range=c(5,15)) 
  theme(legend.position = "none")

image.png

第六项内容：词云图

代码语言：javascript复制

library("tm")
library("wordcloud")
lyrics_text <- lyrics$lyric
lyrics_text<- gsub('[[:punct:]] ', '', lyrics_text)
lyrics_text<- gsub("([[:alpha:]])1 ", "", lyrics_text)
docs <- Corpus(VectorSource(lyrics_text))
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
tdm <- TermDocumentMatrix(docs)
m <- as.matrix(tdm)
word_freqs = sort(rowSums(m), decreasing=TRUE)
lyrics_wc_df <- data.frame(word=names(word_freqs), freq=word_freqs)
lyrics_wc_df <- lyrics_wc_df[1:300,]
set.seed(1234)
wordcloud(words = lyrics_wc_df$word, freq = lyrics_wc_df$freq,
          min.freq = 1,scale=c(1.8,.5),
          max.words=200, random.order=FALSE, rot.per=0.15,
          colors=brewer.pal(8, "Dark2"))

情感分析

剩下的部分有时间回来补上

swift ios 编程算法数据分析

0 人点赞