PySpark’s toDF() function is used to convert RDD to DataFrame, which is necessary because DataFrame offers numerous advantages over RDDs. For example, DataFrames are distributed data collections organized into named columns similar to database tables, and they offer optimization and performance improvements.
To begin with, we can create a PySpark RDD by passing a Python list object to the sparkContext.parallelize()
function. This RDD object will be used in all the examples below. In PySpark, when we have data in a list, it means we have a collection of data in the PySpark driver memory that will be parallelized when we create an RDD.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
dept = [("Engineering", 50),("HR", 60),("Operations", 70),("IT", 80)]
rdd = spark.sparkContext.parallelize(dept)
rdd.collect()

Department names and department numbers to Engineering
, HR
, Operations
, and IT
with department numbers 50
, 60
, 70
, and 80
, respectively.
To convert a PySpark RDD to a DataFrame, there are two methods available: toDF() and createDataFrame(). Let’s discuss these methods in detail.
One way to convert an RDD to a DataFrame is to use the toDF() function provided by PySpark. The toDF() function is a method of RDD that can be used to convert the RDD into a DataFrame.
df = rdd.toDF()
df.printSchema()
df.show(truncate=False)

The toDF() function creates column names as “_1” and “_2” by default. The resulting schema of this code snippet is shown below.
An alternative syntax for toDF() is available, which allows you to specify the column names as arguments. Here is an example of this signature:
deptColumns = ["dept_name","dept_id"]
df2 = rdd.toDF(deptColumns)
df2.printSchema()
df2.show(truncate=False)
The createDataFrame() method from the SparkSession class is used to create a DataFrame in PySpark, and it takes an RDD object as its input argument. This produces the identical output as described earlier.
deptDF = spark.createDataFrame(rdd, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)
To provide a custom schema and specify column name, data type, and nullability for each field/column, we can use StructType while creating a DataFrame using createDataFrame(). By default, the datatype of columns is derived from the data and all columns are nullable. If you want to learn more about using StructType and StructField to define a custom schema, please refer to the documentation on how to use them.
from pyspark.sql.types import StructType,StructField, StringType
deptSchema = StructType([
StructField('dept_name', StringType(), True),
StructField('dept_id', StringType(), True)
])
deptDF1 = spark.createDataFrame(rdd, schema = deptSchema)
deptDF1.printSchema()
deptDF1.show(truncate=False)

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
dept = [("Engineering", 50),("HR", 60),("Operations", 70),("IT", 80)]
rdd = spark.sparkContext.parallelize(dept)
# Using toDF()
df = rdd.toDF()
df.printSchema()
df.show(truncate=False)
deptColumns = ["dept_name", "dept_id"]
df2 = rdd.toDF(deptColumns)
df2.printSchema()
df2.show(truncate=False)
# Using createDataFrame()
deptDF = spark.createDataFrame(rdd, schema=deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)
# Using createDataFrame() with StructType schema
deptSchema = StructType([
StructField('dept_name', StringType(), True),
StructField('dept_id', StringType(), True)
])
deptDF1 = spark.createDataFrame(rdd, schema=deptSchema)
deptDF1.printSchema()
deptDF1.show(truncate=False)