Why replace null values in PySpark DataFrame?
In huge datasets, there can be thousands of rows and hundreds of columns. Out of these columns, there may exist some containing null values or None in more than one cell. Null values in pyspark are nothing but no values in certain rows of String or Integer datatype columns, pyspark considers such blanks as null.
Handle null values in pyspark dataframe
It becomes a tedious job to play with null or None values. Preferably, null values in a PySpark dataframe should be handled with care. You can do the same with None values present in pyspark df. If not handled, can generate erroneous results. That’s why data should be cleaned.
Remove null values from dataframe pyspark
The hard way of dealing with null or no value records is directly removing them out of the way. how we can delete rows with null values in pyspark dataframe? With pyspark df.dropna() the goal can be achieved smoothly resulting in the removal of an entire row. You may think of it as the easiest approach to handle null/None. Problematically, you can lose certain records that contain valuable information in their other columns.
How to replace null values in pyspark dataframe column?
Replace null with 0 in pyspark column
In most cases, the safe way is you replace null values present in pyspark df most notably with 0’s & occasionally by mean value (if the column is numeric) or fix string value. Nevertheless, you can replace None/null with most of the things you desire.
Pyspark allows you to do all that mentioned upwardly. df.fillna() and df.fill() are two powerful pyspark functions but not least, that can do your job of replacing the null.
While using df.fillna() & df.na.fill() you need to take schema in consideration. The datatype of value replacing null must be equal to that of respective columns. If it is not, null appears unaffected. So, putting 0 where there is null within a string column won’t make any change.
df.fillna() & df.na.fill() can be used alternatively. Both pyspark functions have almost the same syntactic structure moreover they work in the same way.
Replacing NaN/None/Null can also be accomplished with the help of PySpark Lambda function one-liners. That’s what you’re going to explore here.
pyspark withcolumn udf lambda example to replace null values
Pyspark udf with Lambda function can also be used to replace the null values. In fact, Lambda functions do it more precisely. Implementing Pyspark udf with the Lambda function is more robust. It will replace null irrespective of column datatype.
Replace/Remove null values from dataframe PySpark Example Code
Apparently, you have become familiar with different ways of handling null values by now. In order to improve understanding, it’s essential to go through some hands-on. By practicing the pyspark code examples you’ll get to know how things actually work in real. So let’s dive into some pyspark programs.
Code for pyspark dataframe used in this Example
#creating pyspark dataframe from python list containing some null employee=[('Tim','Parker','Data Analyst','tid.0678308'), ('Stephen','Brown','Data Analyst',None), ('Steve','Jobs','Data Engineer','tid.5647382'), ('Jack','Downey','Platform Engineer','tid.0025637'), ('Adam','Jones','Data Scientist', None), ('Gwen','Willams','Data Engineer','tid.9875523')] #defining column names columns = ["FirstName","LastName","Title","TaskID"] #importing SparkSession from pyspark.sql import SparkSession #creating a new appName 'Company' for our example spark = SparkSession.builder.appName('Company').getOrCreate() #assigning employee list to data and columns list for schema of our dataframe df = spark.createDataFrame(data=employee, schema = columns) #displaying dataframe df.show()
Code for replacing null values with ‘N/A’ in ‘TaskID’ column
#replacing null values from String type column 'TaskID' with 'N/A' df = df.fillna(value="N/A", subset=['TaskID']) df.show()
#alternate syntax for replacing null df = df.na.fill(value="N/A", subset=['TaskID']) df.show()
Replace null values using pyspark udf and Lambda function
Following code can replace null values with any value, for example, we are putting 0 there. Nevertheless, you can use any String, double, or float value to put back.
#importing functions as F from pyspark.sql import functions as F #import random to generate unique random value import random #this code will replace null with 0 in 'TaskID' column df = df.withColumn('TaskID', F.udf(lambda x: 0 if x is None else x)(F.col('TaskID'))) df.show()
Replace null values with 0 in pyspark dataframe integer cloumn
The dataset and example given below are completely different from the above one. Here the column we are operating on contains all int type values.
#creating a pyspark dataframe with a column holding integer value nums=[('Tim',8), ('Stephen',1), ('Steve',None), ('Jack',7), ('Adam',3), ('Gwen',None)] columns = ["FirstName","Points"] from pyspark.sql import SparkSession spark = SparkSession.builder.appName('NameNum').getOrCreate() df = spark.createDataFrame(data=nums, schema = columns) #removing nulls and puting 0s df = df.fillna(value=0, subset=['Points']) #displaying the dataframe df.show()
In a nutshell, you learned how to handle null/Nan values present in dataframe by different means. You can also check out Apache Spark Question to challenge your spark knowledge. Keep Learning.
Code editor-Google Colab