Skip to content

Wilkinson formulas and Formulaic

Wilkinson formulas are a symbolic notation for expressing statistical models. They were in introduced in Wilkinson and Rogers 1973, Symbolic Description of Factorial Models for Analysis of Variance https://www.jstor.org/stable/2346786 and extended over time for mixed and nested models. The notation is not standardized and there are differences in the individual implementations in R (formula) and Matlab (documentation) and Python (patsy) among others.

My observation is that Wilkinson notation is much less common in the Python stats community than in R. On reason may be that the development of patsy stopped in 2018 and the package started to collect open issues and deprecation warnings since then. I remember looking into using it for my work, but decided against it for the lack of support. Luckily there is now a designated successor called Formulaic. While Formulaic is still in beta, it has almost reached feature parity with patsy and can be used as drop-in replacement in some use cases. For example statsmodels is planning to adopt Formulaic.

import numpy as np
import pandas as pd
import formulaic
from formulaic import Formula

print(formulaic.__version__)
0.2.4

The notation

Wilkinson notation is best shown by example. With y ~ x1 + x2 + x3 we specify that y is modeled by a linear combination of the independent variables x1, x2, x3. Note, that an intercept term is implicitly added.

Formula("y ~ x1 + x2 + x3")
y ~ 1 + x1 + x2 + x3

We can disable the intercept (e.g. if y is centered) by adding -1.

Formula("y ~ x1 + x2 + x3 - 1")
y ~ x1 + x2 + x3

Interactions between variables are denoted with the colon operator, e.g. x1:x2 means that a \(x_1 \cdot x_2\) term is added to the model. Adding all pairwise interactions can be done by writing it out x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3 or in a more convenient way using (x1 + x2 + x3)**2, where the **2 is meant as 2-way interactions, not as a square.

Formula("y ~(x1 + x2 + x3)**2")
y ~ 1 + x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3

Similarly, three-way interactions can be added with (x1 + x2 + x3)**3 and individual terms disabled with, e.g. - x1:x2

Formula("y ~(x1 + x2 + x3)**3 - x1:x2")
y ~ 1 + x1 + x2 + x3 + x1:x3 + x2:x3 + x1:x2:x3

Powers and other transformations can be added inside an I(...). In Formulaic this can also be done with curly braces, e.g. {x1**2}.

Formula("y ~ (x1 + x2 + x3)**2 + {x1**2} + {x2**2} + {x3**2}")
y ~ 1 + x1 + x1**2 + x2 + x2**2 + x3 + x3**2 + x1:x2 + x1:x3 + x2:x3

Nested effects can be specified with a/b which adds a term for a and all interactions of a with b. For example x1 / (x2 + x3) crosses x1 with each of x2, x3:

Formula("y ~ x1 / (x2 + x3)")
y ~ 1 + x1 + x1:x2 + x1:x3

In case variables contain special characters, they need to be quoted in backticks.

Formula("y ~ `x 1` + `x§` + `x#`")
y ~ 1 + x 1 + x# + x§

The full grammar implemented in Formulaic is listed here. Mixed effects modelling using the | and || is not yet implemented.

What's it for?

By passing a dataframe to the formula we get the design matrix X. Here it gets really interesting as categorical variables are automatically dummy-encoded and the formula expanded correspondingly. In the dummy-encoding the first level ("A" in the example below) is always dropped.

df = pd.DataFrame({
    "y": [0, 1, 2],
    "x1": [6, 2, 5],
    "x2": [0.3, 0.1, 0.2],
    "x3": ["A", "B", "C"]
})
y, X = Formula("y ~ (x1 + x2 + x3)**2").get_model_matrix(df)
X
Intercept x1 x2 x3[T.B] x3[T.C] x1:x2 x3[T.B]:x1 x3[T.C]:x1 x3[T.B]:x2 x3[T.C]:x2
0 1.0 6 0.3 0 0 1.8 0 0 0.0 0.0
1 1.0 2 0.1 1 0 0.2 2 0 0.1 0.0
2 1.0 5 0.2 0 1 1.0 0 5 0.0 0.2

X and y are of type ModelMatrix but can be converted back to pd.DataFrames.

type(X)
formulaic.model_matrix.ModelMatrix

We can retreive the model_spec to expand other dataframes to the same design matrix. This is useful when the new dataframe we pass does not contain all categorical levels.

df2 = pd.DataFrame({
    "x1": [4, 3, 5],
    "x2": [0.4, 0.3, 0.1],
    "x3": ["B", "B", "A"]
})

spec = X.model_spec
spec.get_model_matrix(df2)
Intercept x1 x2 x3[T.B] x3[T.C] x1:x2 x3[T.B]:x1 x3[T.C]:x1 x3[T.B]:x2 x3[T.C]:x2
0 1.0 4 0.4 1 0.0 1.6 4 0.0 0.4 0.0
1 1.0 3 0.3 1 0.0 0.9 3 0.0 0.3 0.0
2 1.0 5 0.1 0 0.0 0.5 0 0.0 0.0 0.0

An example

To get a feeling for formulaic in practice, we'll model the BaumgartnerAniline dataset from mopti.

import opti

problem = opti.problems.BaumgartnerAniline()
df = problem.data
problem
Problem(
name=Baumgartner 2019 - Aniline Cross-Coupling,
inputs=Parameters(
[Categorical(name='catalyst', domain=['tBuXPhos', 'tBuBrettPhos', 'AlPhos']),
 Categorical(name='base', domain=['TEA', 'TMG', 'BTMG', 'DBU']),
 Continuous(name='base_equivalents', domain=[1.0, 2.5]),
 Continuous(name='temperature', domain=[30, 100]),
 Continuous(name='residence_time', domain=[60, 1800])]
),
outputs=Parameters(
[Continuous(name='yield', domain=[0, 1])]
),
objectives=Objectives(
[Maximize('yield', target=0)]
),
data=
   catalyst  base  base_equivalents  temperature  residence_time     yield
0  tBuXPhos   DBU          2.183015         30.0      328.717802  0.042833
1  tBuXPhos  BTMG          2.190882        100.0       73.331194  0.959690
2  tBuXPhos   TMG          1.093138         47.5       75.121297  0.031579
3  tBuXPhos   TMG          2.186276        100.0      673.259508  0.766768
4  tBuXPhos   TEA          1.108767         30.0      107.541151  0.072299
)

Now we're ready to try out a number of models. We'll use a linear regressor from scikit-learn and evaluate using leave-one-out cross-validation \(R^2\) score (\(Q^2\)). Since the regressor includes an intercept term by default, we'll remove the one in the design matrix. A linear model gives \(Q^2 = 0.73\) and \(R^2 = 0.78\) indicating non-linearities in the response.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict, LeaveOneOut
from sklearn.metrics import r2_score

formula = Formula("yield ~ catalyst + base + base_equivalents + temperature + residence_time - 1")
y, X = formula.get_model_matrix(df)
model = LinearRegression().fit(X, y)
r2 = r2_score(y, model.predict(X))
q2 = r2_score(y, cross_val_predict(model, X, y, cv=LeaveOneOut()))
print(f"{formula}\nR²={r2:.2f} Q²={q2:.2f}")
yield ~ base + base_equivalents + catalyst + residence_time + temperature
R²=0.78 Q²=0.73

With all interactions terms this increases to \(Q^2 = 0.87\) and \(R^2 = 0.95\).

formula = Formula("yield ~ (catalyst + base + base_equivalents + temperature + residence_time)**2 - 1")
y, X = formula.get_model_matrix(df)
model = LinearRegression().fit(X, y)
r2 = r2_score(y, model.predict(X))
q2 = r2_score(y, cross_val_predict(model, X, y, cv=LeaveOneOut()))
print(f"{formula}\nR²={r2:.2f} Q²={q2:.2f}")
yield ~ base + base_equivalents + catalyst + residence_time + temperature + base:base_equivalents + base:catalyst + base:residence_time + base:temperature + base_equivalents:catalyst + base_equivalents:residence_time + base_equivalents:temperature + catalyst:residence_time + catalyst:temperature + residence_time:temperature
R²=0.95 Q²=0.87

Adding 3-way interactions does not improve the fit.

formula = Formula("yield ~ (catalyst + base + base_equivalents + temperature + residence_time)**3 - 1")
y, X = formula.get_model_matrix(df)
model = LinearRegression().fit(X, y)
r2 = r2_score(y, model.predict(X))
q2 = r2_score(y, cross_val_predict(model, X, y, cv=LeaveOneOut()))
print(f"{formula}\nR²={r2:.2f} Q²={q2:.2f}")
yield ~ base + base_equivalents + catalyst + residence_time + temperature + base:base_equivalents + base:catalyst + base:residence_time + base:temperature + base_equivalents:catalyst + base_equivalents:residence_time + base_equivalents:temperature + catalyst:residence_time + catalyst:temperature + residence_time:temperature + base:base_equivalents:catalyst + base:base_equivalents:residence_time + base:base_equivalents:temperature + base:catalyst:residence_time + base:catalyst:temperature + base:residence_time:temperature + base_equivalents:catalyst:residence_time + base_equivalents:catalyst:temperature + base_equivalents:residence_time:temperature + catalyst:residence_time:temperature
R²=0.99 Q²=0.41

Sure, this can also be done with one-hot encoding and polynomial feature expansion in scikit-learn. However, Formulaic is more expressive and prevents making errors such as not dropping a level in the one-hot encoding, or taking the square terms of dummy variables.

For comparison this is what it looks in statsmodels. Note that we have to rename the columns as patsy doesn't like their names.

from statsmodels.formula.api import ols

df2 = df.copy()
df2.columns = ["x1", "x2", "x3", "x4", "x5", "y"]
model = ols("y ~ (x1 + x2 + x3 + x4 + x5)**2", data=df2).fit()
model.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.951
Model: OLS Adj. R-squared: 0.927
Method: Least Squares F-statistic: 38.44
Date: Sun, 07 Nov 2021 Prob (F-statistic): 1.38e-30
Time: 16:30:35 Log-Likelihood: 92.203
No. Observations: 96 AIC: -118.4
Df Residuals: 63 BIC: -33.78
Df Model: 32
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.2168 0.187 6.496 0.000 0.842 1.591
x1[T.tBuBrettPhos] -0.1101 0.150 -0.733 0.466 -0.410 0.190
x1[T.tBuXPhos] -0.3721 0.152 -2.444 0.017 -0.676 -0.068
x2[T.DBU] -0.5980 0.172 -3.478 0.001 -0.941 -0.254
x2[T.TEA] -0.6998 0.192 -3.636 0.001 -1.084 -0.315
x2[T.TMG] -1.3760 0.163 -8.432 0.000 -1.702 -1.050
x1[T.tBuBrettPhos]:x2[T.DBU] -0.2011 0.093 -2.159 0.035 -0.387 -0.015
x1[T.tBuXPhos]:x2[T.DBU] -0.1683 0.087 -1.943 0.056 -0.341 0.005
x1[T.tBuBrettPhos]:x2[T.TEA] -0.1892 0.101 -1.868 0.066 -0.392 0.013
x1[T.tBuXPhos]:x2[T.TEA] -0.0336 0.098 -0.344 0.732 -0.229 0.162
x1[T.tBuBrettPhos]:x2[T.TMG] -0.0267 0.087 -0.308 0.759 -0.200 0.147
x1[T.tBuXPhos]:x2[T.TMG] -0.1010 0.083 -1.224 0.226 -0.266 0.064
x3 -0.1204 0.096 -1.259 0.213 -0.311 0.071
x1[T.tBuBrettPhos]:x3 -0.0198 0.066 -0.301 0.765 -0.151 0.112
x1[T.tBuXPhos]:x3 0.0207 0.065 0.318 0.752 -0.109 0.151
x2[T.DBU]:x3 -0.0130 0.071 -0.183 0.855 -0.155 0.129
x2[T.TEA]:x3 0.0077 0.074 0.105 0.917 -0.140 0.155
x2[T.TMG]:x3 -0.0497 0.074 -0.668 0.507 -0.198 0.099
x4 -0.0056 0.002 -2.694 0.009 -0.010 -0.001
x1[T.tBuBrettPhos]:x4 0.0037 0.001 2.999 0.004 0.001 0.006
x1[T.tBuXPhos]:x4 0.0042 0.001 3.520 0.001 0.002 0.007
x2[T.DBU]:x4 0.0077 0.001 6.078 0.000 0.005 0.010
x2[T.TEA]:x4 -0.0005 0.001 -0.346 0.730 -0.003 0.002
x2[T.TMG]:x4 0.0123 0.001 10.457 0.000 0.010 0.015
x5 0.0002 0.000 1.414 0.162 -7.52e-05 0.000
x1[T.tBuBrettPhos]:x5 -9.09e-05 7.14e-05 -1.273 0.208 -0.000 5.18e-05
x1[T.tBuXPhos]:x5 3.655e-06 6.63e-05 0.055 0.956 -0.000 0.000
x2[T.DBU]:x5 1.023e-05 7.86e-05 0.130 0.897 -0.000 0.000
x2[T.TEA]:x5 -9.419e-05 7.73e-05 -1.219 0.227 -0.000 6.02e-05
x2[T.TMG]:x5 0.0002 6.87e-05 2.739 0.008 5.09e-05 0.000
x3:x4 0.0022 0.001 2.495 0.015 0.000 0.004
x3:x5 -3.465e-05 5.26e-05 -0.659 0.512 -0.000 7.04e-05
x4:x5 -1.93e-07 1.04e-06 -0.186 0.853 -2.27e-06 1.89e-06
Omnibus: 19.784 Durbin-Watson: 1.865
Prob(Omnibus): 0.000 Jarque-Bera (JB): 34.781
Skew: -0.832 Prob(JB): 2.80e-08
Kurtosis: 5.434 Cond. No. 1.83e+06

Notes:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.[2] The condition number is large, 1.83e+06. This might indicate that there arestrong multicollinearity or other numerical problems.