Wilkinson formulas and Formulaic
¶
Wilkinson formulas are a symbolic notation for expressing statistical models. They were in introduced in Wilkinson and Rogers 1973, Symbolic Description of Factorial Models for Analysis of Variance https://www.jstor.org/stable/2346786 and extended over time for mixed and nested models.
The notation is not standardized and there are differences in the individual implementations in R (formula
) and Matlab (documentation) and Python (patsy
) among others.
My observation is that Wilkinson notation is much less common in the Python stats community than in R.
On reason may be that the development of patsy
stopped in 2018 and the package started to collect open issues and deprecation warnings since then.
I remember looking into using it for my work, but decided against it for the lack of support.
Luckily there is now a designated successor called Formulaic
.
While Formulaic
is still in beta, it has almost reached feature parity with patsy
and can be used as drop-in replacement in some use cases.
For example statsmodels
is planning to adopt Formulaic.
import numpy as np
import pandas as pd
import formulaic
from formulaic import Formula
print(formulaic.__version__)
0.2.4
The notation¶
Wilkinson notation is best shown by example.
With y ~ x1 + x2 + x3
we specify that y
is modeled by a linear combination of the independent variables x1
, x2
, x3
.
Note, that an intercept term is implicitly added.
Formula("y ~ x1 + x2 + x3")
y ~ 1 + x1 + x2 + x3
We can disable the intercept (e.g. if y
is centered) by adding -1
.
Formula("y ~ x1 + x2 + x3 - 1")
y ~ x1 + x2 + x3
Interactions between variables are denoted with the colon operator, e.g. x1:x2
means that a $x_1 \cdot x_2$ term is added to the model.
Adding all pairwise interactions can be done by writing it out x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3
or in a more convenient way using (x1 + x2 + x3)**2
, where the **2
is meant as 2-way interactions, not as a square.
Formula("y ~(x1 + x2 + x3)**2")
y ~ 1 + x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3
Similarly, three-way interactions can be added with (x1 + x2 + x3)**3
and individual terms disabled with, e.g. - x1:x2
Formula("y ~(x1 + x2 + x3)**3 - x1:x2")
y ~ 1 + x1 + x2 + x3 + x1:x3 + x2:x3 + x1:x2:x3
Powers and other transformations can be added inside an I(...)
.
In Formulaic this can also be done with curly braces, e.g. {x1**2}
.
Formula("y ~ (x1 + x2 + x3)**2 + {x1**2} + {x2**2} + {x3**2}")
y ~ 1 + x1 + x1**2 + x2 + x2**2 + x3 + x3**2 + x1:x2 + x1:x3 + x2:x3
Nested effects can be specified with a/b
which adds a term for a
and all interactions of a
with b
.
For example x1 / (x2 + x3)
crosses x1
with each of x2
, x3
:
Formula("y ~ x1 / (x2 + x3)")
y ~ 1 + x1 + x1:x2 + x1:x3
In case variables contain special characters, they need to be quoted in backticks.
Formula("y ~ `x 1` + `x§` + `x#`")
y ~ 1 + x 1 + x# + x§
The full grammar implemented in Formulaic is listed here.
Mixed effects modelling using the |
and ||
is not yet implemented.
What's it for?¶
By passing a dataframe to the formula we get the design matrix X
.
Here it gets really interesting as categorical variables are automatically dummy-encoded and the formula expanded correspondingly.
In the dummy-encoding the first level ("A" in the example below) is always dropped.
df = pd.DataFrame({
"y": [0, 1, 2],
"x1": [6, 2, 5],
"x2": [0.3, 0.1, 0.2],
"x3": ["A", "B", "C"]
})
y, X = Formula("y ~ (x1 + x2 + x3)**2").get_model_matrix(df)
X
Intercept | x1 | x2 | x3[T.B] | x3[T.C] | x1:x2 | x3[T.B]:x1 | x3[T.C]:x1 | x3[T.B]:x2 | x3[T.C]:x2 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 6 | 0.3 | 0 | 0 | 1.8 | 0 | 0 | 0.0 | 0.0 |
1 | 1.0 | 2 | 0.1 | 1 | 0 | 0.2 | 2 | 0 | 0.1 | 0.0 |
2 | 1.0 | 5 | 0.2 | 0 | 1 | 1.0 | 0 | 5 | 0.0 | 0.2 |
X
and y
are of type ModelMatrix
but can be converted back to pd.DataFrames
.
type(X)
formulaic.model_matrix.ModelMatrix
We can retreive the model_spec
to expand other dataframes to the same design matrix.
This is useful when the new dataframe we pass does not contain all categorical levels.
df2 = pd.DataFrame({
"x1": [4, 3, 5],
"x2": [0.4, 0.3, 0.1],
"x3": ["B", "B", "A"]
})
spec = X.model_spec
spec.get_model_matrix(df2)
Intercept | x1 | x2 | x3[T.B] | x3[T.C] | x1:x2 | x3[T.B]:x1 | x3[T.C]:x1 | x3[T.B]:x2 | x3[T.C]:x2 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 4 | 0.4 | 1 | 0.0 | 1.6 | 4 | 0.0 | 0.4 | 0.0 |
1 | 1.0 | 3 | 0.3 | 1 | 0.0 | 0.9 | 3 | 0.0 | 0.3 | 0.0 |
2 | 1.0 | 5 | 0.1 | 0 | 0.0 | 0.5 | 0 | 0.0 | 0.0 | 0.0 |
import opti
problem = opti.problems.BaumgartnerAniline()
df = problem.data
problem
Problem( name=Baumgartner 2019 - Aniline Cross-Coupling, inputs=Parameters( [Categorical(name='catalyst', domain=['tBuXPhos', 'tBuBrettPhos', 'AlPhos']), Categorical(name='base', domain=['TEA', 'TMG', 'BTMG', 'DBU']), Continuous(name='base_equivalents', domain=[1.0, 2.5]), Continuous(name='temperature', domain=[30, 100]), Continuous(name='residence_time', domain=[60, 1800])] ), outputs=Parameters( [Continuous(name='yield', domain=[0, 1])] ), objectives=Objectives( [Maximize('yield', target=0)] ), data= catalyst base base_equivalents temperature residence_time yield 0 tBuXPhos DBU 2.183015 30.0 328.717802 0.042833 1 tBuXPhos BTMG 2.190882 100.0 73.331194 0.959690 2 tBuXPhos TMG 1.093138 47.5 75.121297 0.031579 3 tBuXPhos TMG 2.186276 100.0 673.259508 0.766768 4 tBuXPhos TEA 1.108767 30.0 107.541151 0.072299 )
Now we're ready to try out a number of models. We'll use a linear regressor from scikit-learn and evaluate using leave-one-out cross-validation $R^2$ score ($Q^2$). Since the regressor includes an intercept term by default, we'll remove the one in the design matrix. A linear model gives $Q^2 = 0.73$ and $R^2 = 0.78$ indicating non-linearities in the response.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict, LeaveOneOut
from sklearn.metrics import r2_score
formula = Formula("yield ~ catalyst + base + base_equivalents + temperature + residence_time - 1")
y, X = formula.get_model_matrix(df)
model = LinearRegression().fit(X, y)
r2 = r2_score(y, model.predict(X))
q2 = r2_score(y, cross_val_predict(model, X, y, cv=LeaveOneOut()))
print(f"{formula}\nR²={r2:.2f} Q²={q2:.2f}")
yield ~ base + base_equivalents + catalyst + residence_time + temperature R²=0.78 Q²=0.73
With all interactions terms this increases to $Q^2 = 0.87$ and $R^2 = 0.95$.
formula = Formula("yield ~ (catalyst + base + base_equivalents + temperature + residence_time)**2 - 1")
y, X = formula.get_model_matrix(df)
model = LinearRegression().fit(X, y)
r2 = r2_score(y, model.predict(X))
q2 = r2_score(y, cross_val_predict(model, X, y, cv=LeaveOneOut()))
print(f"{formula}\nR²={r2:.2f} Q²={q2:.2f}")
yield ~ base + base_equivalents + catalyst + residence_time + temperature + base:base_equivalents + base:catalyst + base:residence_time + base:temperature + base_equivalents:catalyst + base_equivalents:residence_time + base_equivalents:temperature + catalyst:residence_time + catalyst:temperature + residence_time:temperature R²=0.95 Q²=0.87
Adding 3-way interactions does not improve the fit.
formula = Formula("yield ~ (catalyst + base + base_equivalents + temperature + residence_time)**3 - 1")
y, X = formula.get_model_matrix(df)
model = LinearRegression().fit(X, y)
r2 = r2_score(y, model.predict(X))
q2 = r2_score(y, cross_val_predict(model, X, y, cv=LeaveOneOut()))
print(f"{formula}\nR²={r2:.2f} Q²={q2:.2f}")
yield ~ base + base_equivalents + catalyst + residence_time + temperature + base:base_equivalents + base:catalyst + base:residence_time + base:temperature + base_equivalents:catalyst + base_equivalents:residence_time + base_equivalents:temperature + catalyst:residence_time + catalyst:temperature + residence_time:temperature + base:base_equivalents:catalyst + base:base_equivalents:residence_time + base:base_equivalents:temperature + base:catalyst:residence_time + base:catalyst:temperature + base:residence_time:temperature + base_equivalents:catalyst:residence_time + base_equivalents:catalyst:temperature + base_equivalents:residence_time:temperature + catalyst:residence_time:temperature R²=0.99 Q²=0.41
Sure, this can also be done with one-hot encoding and polynomial feature expansion in scikit-learn. However, Formulaic is more expressive and prevents making errors such as not dropping a level in the one-hot encoding, or taking the square terms of dummy variables.
For comparison this is what it looks in statsmodels.
Note that we have to rename the columns as patsy
doesn't like their names.
from statsmodels.formula.api import ols
df2 = df.copy()
df2.columns = ["x1", "x2", "x3", "x4", "x5", "y"]
model = ols("y ~ (x1 + x2 + x3 + x4 + x5)**2", data=df2).fit()
model.summary()
Dep. Variable: | y | R-squared: | 0.951 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.927 |
Method: | Least Squares | F-statistic: | 38.44 |
Date: | Sun, 07 Nov 2021 | Prob (F-statistic): | 1.38e-30 |
Time: | 16:30:35 | Log-Likelihood: | 92.203 |
No. Observations: | 96 | AIC: | -118.4 |
Df Residuals: | 63 | BIC: | -33.78 |
Df Model: | 32 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 1.2168 | 0.187 | 6.496 | 0.000 | 0.842 | 1.591 |
x1[T.tBuBrettPhos] | -0.1101 | 0.150 | -0.733 | 0.466 | -0.410 | 0.190 |
x1[T.tBuXPhos] | -0.3721 | 0.152 | -2.444 | 0.017 | -0.676 | -0.068 |
x2[T.DBU] | -0.5980 | 0.172 | -3.478 | 0.001 | -0.941 | -0.254 |
x2[T.TEA] | -0.6998 | 0.192 | -3.636 | 0.001 | -1.084 | -0.315 |
x2[T.TMG] | -1.3760 | 0.163 | -8.432 | 0.000 | -1.702 | -1.050 |
x1[T.tBuBrettPhos]:x2[T.DBU] | -0.2011 | 0.093 | -2.159 | 0.035 | -0.387 | -0.015 |
x1[T.tBuXPhos]:x2[T.DBU] | -0.1683 | 0.087 | -1.943 | 0.056 | -0.341 | 0.005 |
x1[T.tBuBrettPhos]:x2[T.TEA] | -0.1892 | 0.101 | -1.868 | 0.066 | -0.392 | 0.013 |
x1[T.tBuXPhos]:x2[T.TEA] | -0.0336 | 0.098 | -0.344 | 0.732 | -0.229 | 0.162 |
x1[T.tBuBrettPhos]:x2[T.TMG] | -0.0267 | 0.087 | -0.308 | 0.759 | -0.200 | 0.147 |
x1[T.tBuXPhos]:x2[T.TMG] | -0.1010 | 0.083 | -1.224 | 0.226 | -0.266 | 0.064 |
x3 | -0.1204 | 0.096 | -1.259 | 0.213 | -0.311 | 0.071 |
x1[T.tBuBrettPhos]:x3 | -0.0198 | 0.066 | -0.301 | 0.765 | -0.151 | 0.112 |
x1[T.tBuXPhos]:x3 | 0.0207 | 0.065 | 0.318 | 0.752 | -0.109 | 0.151 |
x2[T.DBU]:x3 | -0.0130 | 0.071 | -0.183 | 0.855 | -0.155 | 0.129 |
x2[T.TEA]:x3 | 0.0077 | 0.074 | 0.105 | 0.917 | -0.140 | 0.155 |
x2[T.TMG]:x3 | -0.0497 | 0.074 | -0.668 | 0.507 | -0.198 | 0.099 |
x4 | -0.0056 | 0.002 | -2.694 | 0.009 | -0.010 | -0.001 |
x1[T.tBuBrettPhos]:x4 | 0.0037 | 0.001 | 2.999 | 0.004 | 0.001 | 0.006 |
x1[T.tBuXPhos]:x4 | 0.0042 | 0.001 | 3.520 | 0.001 | 0.002 | 0.007 |
x2[T.DBU]:x4 | 0.0077 | 0.001 | 6.078 | 0.000 | 0.005 | 0.010 |
x2[T.TEA]:x4 | -0.0005 | 0.001 | -0.346 | 0.730 | -0.003 | 0.002 |
x2[T.TMG]:x4 | 0.0123 | 0.001 | 10.457 | 0.000 | 0.010 | 0.015 |
x5 | 0.0002 | 0.000 | 1.414 | 0.162 | -7.52e-05 | 0.000 |
x1[T.tBuBrettPhos]:x5 | -9.09e-05 | 7.14e-05 | -1.273 | 0.208 | -0.000 | 5.18e-05 |
x1[T.tBuXPhos]:x5 | 3.655e-06 | 6.63e-05 | 0.055 | 0.956 | -0.000 | 0.000 |
x2[T.DBU]:x5 | 1.023e-05 | 7.86e-05 | 0.130 | 0.897 | -0.000 | 0.000 |
x2[T.TEA]:x5 | -9.419e-05 | 7.73e-05 | -1.219 | 0.227 | -0.000 | 6.02e-05 |
x2[T.TMG]:x5 | 0.0002 | 6.87e-05 | 2.739 | 0.008 | 5.09e-05 | 0.000 |
x3:x4 | 0.0022 | 0.001 | 2.495 | 0.015 | 0.000 | 0.004 |
x3:x5 | -3.465e-05 | 5.26e-05 | -0.659 | 0.512 | -0.000 | 7.04e-05 |
x4:x5 | -1.93e-07 | 1.04e-06 | -0.186 | 0.853 | -2.27e-06 | 1.89e-06 |
Omnibus: | 19.784 | Durbin-Watson: | 1.865 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 34.781 |
Skew: | -0.832 | Prob(JB): | 2.80e-08 |
Kurtosis: | 5.434 | Cond. No. | 1.83e+06 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.83e+06. This might indicate that there are
strong multicollinearity or other numerical problems.