Parsing Excel formulas in PySpark

This blog shows a couple of examples of how you can use the Spark Excel Library in PySaprk to parse Excel formulas:

Example 1

In this example, the code uses the “com.crealytics.spark.excel” library to read an Excel file into a DataFrame. Then it defines a function “parse_formulas” to filter out the rows in a given column that contain formulas (i.e. start with “=”) and returns those rows as a new DataFrame. Finally, it uses the function to parse formulas in column “A” of the Excel file and display the results.

Example 2

In this example, the df.withColumn("formulas", col("formula_column").rlike("^=[A-Z]+")) the statement creates a new column called “formulas” that contains only the formulas from the specified column, “formula_column”, using the rlike() function to match the regular expression “^=[A-Z]+”. The ^= is used to match the first character of the formula, which is “=” and the [A-Z]+ is used to match any uppercase letters that come after the “=”. The show() method at the end of the code snippet displays the result.

You can adjust this code to match your excel sheet and columns accordingly. Also, you can change the regular expression to match your formula if it has a different format.

Irtaza Hassan

irtaza@gmail.com

Parsing Excel formulas in PySpark

Example 1

Example 2

About the Author: Irtaza

Leave a Reply Cancel reply

Example 1

Example 2

You May Also Like

Analysing UK House Price Data with Spark, Athena and Tableau

About the Author: Irtaza

Leave a Reply Cancel reply