Using Decorators to Solve Date Problems
A decorator
is the gateway drug into the world of Python metaprogramming. In python, everything, everything, is an object (specifically a dictionary but let’s not go there). That means that we can pass in and return any object regardless of its types, especially regardless of its type.
If I define a function:
def fn(*args, **kwargs):
pass
and now call type
on fn
type(fn)
function
the type
is function
(No surprises there). But remember, we can return anything. So if I really wanted to, I could do the following:
def parent(num):
def firstchild():
print('Hi I\'m the first child')
def notfirstchild():
print('Hi, I\'m the other child')
if num == 1:
return firstchild
else:
return notfirstchild
Now, if I call parent
, the return of the function is another function, which depends on the input
f = parent(1)
f()
Hi I'm the first child
f = parent(2)
f()
Hi, I'm the other child
Note the output is a function, which I can call just like any other function!
Functions, Functions Everywhere#
In the following, we take this functions-are-objects concept further. A function called decorator
accepts another function as input. Inside this decorator
function, another wrapper
function is defined, whose responsibility is to call the function passed in to the decorator, and add additional functionality to the original function. This is huge!!! It means we can append certain things (such as logs, etc), preserving original functionality with little to no modification of the original function.
def decorator(func):
def wrapper(*args, **kwargs):
print('From the wrapper')
func(*args, **kwargs)
return wrapper
def some_function(*args, **kwargs):
print('from the function')
decorated_function = decorator(some_function)
# without decoration
some_function()
from the function
# with decoration
decorated_function()
From the wrapper
from the function
Using some of python’s “syntactic sugar” as this RealPython article calls it, we can make the above much more compact:
@decorator
def some_function(*args, **kwargs):
print('from the function')
some_function()
From the wrapper
from the function
And we achieve the same functionality!
Because that Wasn’t Convoluted Enough#
Okay let’s add an additional step, and then I’d walk through a real-world example I had to implement recently.
What if, in addition to arguments to the function, I want to pass arguments to the decorator? Let’s say I want a decorator which runs a given function multiple times, but I want to configure how many times the function is run depending on the function being decorated:
import functools
def decorator(num_times_to_run):
def _decorator(function):
@functools.wraps(function)
def wrapper(*args, **kwargs):
for _ in range(num_times_to_run):
function(*args, **kwargs)
return wrapper
return _decorator
@decorator(num_times_to_run=2)
def function_one():
print('from function one')
function_one()
from function one
from function one
@decorator(num_times_to_run=8)
def function_two():
print('from function two')
function_two()
from function two
from function two
from function two
from function two
from function two
from function two
from function two
from function two
From the above, the decorator accepted some configuration to determine how many times the decorated function is run. This is a toy example, but the following now goes through an application which I actually found quite useful!
A Real-World Example#
Imagine we have a series of functions designed to clean some set of data, and imagine that they have their set of individual arguments, depending on the function. The only common argument is a single dataframe within which any data-cleaning processes would be done:
def clean_strings(df, *args, **kwargs):
# do string cleaning to df
return df
def remove_stopwords(df, *args, **kwargs):
# do stopword removal
return df
def calculate_windows(df, *args, **kwargs):
# calculate windows
return df
(not this is a watered-down, simplified example for the sake of conveying the usefulness of the decorator).
Now, imagine that the above functions may handle multiple dataframes, with multiple types of columns, one type of which may be dates
. The issue arises when certain processing stages (such as calculation of windows) depends on the date columns but the date columns are formatted irregularly. For example:
Date Format | Pattern |
---|---|
Julian | yyyy/DDD |
American | dd/MM/yyyy |
..and the list goes on, but you get the point
Now let’s say that I want to standardize the input to all my cleaning functions. Solution 1 would be to define some function clean_dates
which takes in the dataframe, cleans the date columns specified by some configuration and return the cleaned dataframe.
I don’t like this approach for two reasons:
- I (or whoever uses my code) may completely forget to run my
clean_dates
function and - This approach adds additional lines that may take away from the overall story of my analysis (this is a personal preference, and I’m not saying either approach is objectively “better” than the other, using decorators just gives me the excuse to learn about new python patterns as well as write neater, easier-to-use code)
Solving The Above using Decorators#
Here’s what I ended up settling on:
import functools
date_cols = {
'american': ['column_one'],
'julian': ['column_two'],
'inversejulian': ['column_three']
}
def datefixer(dateconf):
import pyspark
from pyspark.sql import functions as F
def _datefixer(func):
@functools.wraps(func)
def wrapper(df, *args, **kwargs):
df_dateconf = {}
for key, values in dateconf.items():
df_dateconf[key] = [i for i in df.columns if i in values]
for dateformat in df_dateconf.keys():
for datecolumn in df_dateconf[dateformat]:
if dateformat == 'american':
df = df.withColumn(datecolumn, F.to_date(datecolumn, 'dd/MM/yyyy'))
if dateformat == 'julian':
df = df.withColumn(datecolumn, F.to_date(datecolumn, 'yyyy/DDD'))
if dateformat == 'inversejulian':
df = df.withColumn(datecolumn, F.to_date(datecolumn, 'DDD/yyyy'))
return func(df, *args, **kwargs)
return wrapper
return _datefixer
The parent datefixer
function takes a configuration (an example of which is given) which is a dictionary, mapping a date-format to a list of (potential) column names which may exist in the dataframes.
These lines:
for key, values in dateconf.items():
df_dateconf[key] = [i for i in df.columns if i in values]
create a mapping of date columns which exist in the dataframe. This allows me to have a single configuration regardless of the function being decorated.
This section:
for dateformat in df_dateconf.keys():
for datecolumn in df_dateconf[dateformat]:
print('converting', dateformat)
if dateformat == 'american':
df = df.withColumn(datecolumn, F.to_date(datecolumn, 'dd/MM/yyyy'))
if dateformat == 'julian':
df = df.withColumn(datecolumn, F.to_date(datecolumn, 'yyyy/DDD'))
then takes the input dataframe and applies standard formatting depending on the type-name pairing specified in the configuration.
After this, I simply return the original function:
return func(df, *args, **kwargs)
with its initial set of argumetns, but a fully-cleaned dataframe!
Testing the above decorator with (potential) data-cleaning functions:
@datefixer(dateconf=date_cols)
def clean_one(df):
# do some cleaning
return df
@datefixer(dateconf=date_cols)
def clean_two(df, *args, **kwargs):
# do some other cleaning
return df
# creating some dummy data
import pandas as pd
from pyspark.sql import SparkSession
sc = SparkSession.builder.appName('decorators').getOrCreate()
data = pd.DataFrame({
'column_one': ['06/07/2022'],
'column_two': ['1997/310'],
'column_three': ['310/1997'],
})
df = sc.createDataFrame(data)
# uncleaned
df.show()
+----------+----------+------------+
|column_one|column_two|column_three|
+----------+----------+------------+
|06/07/2022| 1997/310| 310/1997|
+----------+----------+------------+
# applying the decorated functions
clean_one(df).show()
+----------+----------+------------+
|column_one|column_two|column_three|
+----------+----------+------------+
|2022-07-06|1997-11-06| 1997-11-06|
+----------+----------+------------+
We can do the same with both args
and kwargs
!!
clean_two(df, 23, a_keyword_argument=1).show()
+----------+----------+------------+
|column_one|column_two|column_three|
+----------+----------+------------+
|2022-07-06|1997-11-06| 1997-11-06|
+----------+----------+------------+
In conclusion, the above uses decorators
, an aspect of Python metaprogramming to standardize data-processing in Python.