Create Substring Column In Spark Dataframe


Answer :

Such statement can be used



import org.apache.spark.sql.functions._


dataFrame.select(col("a"), substring_index(col("a"), ",", 1).as("b"))



Suppose you have the following dataframe:



import spark.implicits._
import org.apache.spark.sql.functions._

var df = sc.parallelize(Seq(("foobar", "foo"))).toDF("a", "b")

+------+---+
| a| b|
+------+---+
|foobar|foo|
+------+---+


You could subset a new column from the first column as follows:



df = df.select(col("*"), substring(col("a"), 4, 6).as("c"))

+------+---+---+
| a| b| c|
+------+---+---+
|foobar|foo|bar|
+------+---+---+


You would use the withColumn function



import org.apache.spark.sql.functions.{ udf, col }
def substringFn(str: String) = your substring code
val substring = udf(substringFn _)
dataframe.withColumn("b", substring(col("a"))


Comments

Popular posts from this blog

Converting A String To Int In Groovy

"Cannot Create Cache Directory /home//.composer/cache/repo/https---packagist.org/, Or Directory Is Not Writable. Proceeding Without Cache"

Android SDK Location Should Not Contain Whitespace, As This Cause Problems With NDK Tools