spark dataframe新增列的处理

2022-05-07 14:26:44 浏览数 (1)

往一个dataframe新增某个列是很常见的事情。

然而这个资料还是不多，很多都需要很多变换。而且一些字段可能还不太好添加。

不过由于这回需要增加的列非常简单，倒也没有必要再用UDF函数去修改列。

利用withColumn函数就能实现对dataframe中列的添加。但是由于withColumn这个函数中的第二个参数col必须为原有的某一列。所以默认先选择了个ID。

scala> val df = sqlContext.range(0, 10) df: org.apache.spark.sql.DataFrame = [id: bigint] scala> df.show() --- | id| --- | 0| | 1| | 2| | 3| | 4| | 5| | 6| | 7| | 8| | 9| --- scala> df.withColumn("bb",col(id)*0) <console>:28: error: not found: value id df.withColumn("bb",col(id)*0) ^ scala> df.withColumn("bb",col("id")*0) res2: org.apache.spark.sql.DataFrame = [id: bigint, bb: bigint] scala> df.show() --- | id| --- | 0| | 1| | 2| | 3| | 4| | 5| | 6| | 7| | 8| | 9| --- scala> res2.show() --- --- | id| bb| --- --- | 0| 0| | 1| 0| | 2| 0| | 3| 0| | 4| 0| | 5| 0| | 6| 0| | 7| 0| | 8| 0| | 9| 0| --- --- scala> res2.withColumn("cc",col("id")*0) res5: org.apache.spark.sql.DataFrame = [id: bigint, bb: bigint, cc: bigint] scala> res3.show() <console>:30: error: value show is not a member of Unit res3.show() ^ scala> res5.show() --- --- --- | id| bb| cc| --- --- --- | 0| 0| 0| | 1| 0| 0| | 2| 0| 0| | 3| 0| 0| | 4| 0| 0| | 5| 0| 0| | 6| 0| 0| | 7| 0| 0| | 8| 0| 0| | 9| 0| 0| --- --- --- scala>

scala bash bash指令 spark

0 人点赞