I’ve had a rule of thumb for data modeling that unfortunately I forget on a regular basis, so I thought I’d write it down to maybe help myself actually use it more.
The rule of thumb is “always separate user intent from derived behavior/business logic”.
This rule is particularly applicable to things like state
or status
fields, which have a way of becoming “half-user controlled” and “half-system controlled”.
As the most recent example where I’ve run across this, consider a system of tasks for a stereotypical “tasks have predecessors/successor in a project plan”:
In terms of data model, we’ll focus on just a few things:
- There is a task
status
field that has three potential values:NotStarted
,InProgress
, andComplete
- The requirements state that the system handles all
NotStarted <-> InProgress
transitions, i.e. it “auto-starts” tasks once preceeding tasks areComplete
. - However only the user can say “this task is actually done (or not done)”.
An Okay Way
An initial attempt at modeling this is a single Task.status
field that is an enum of NotStarted
, InProgress
, and Complete
.
Then we use business logic to do “not rocket science but still somewhat nuanced” things like:
- Anytime a predecessor task changes maybe change the successor Task’s
status
, but only if it’s notComplete
- In the UI, treat
status = Complete
as “you checked complete” butstatus = NotStarted | InProgress
as “you didn’t check complete”
This is all fine and not that bad, but we end up with a “sometimes the field is written by X and sometimes it is written by Y”:
Which is not terrible, but generally more of a “business logic is hidden in susceptible-to-being-spaghetti ‘push’ code”.
I.e. it’s pretty common in this setup for, if the user unchecks “task is complete”, to forget to re-run the “ah right, set it back to the ‘based on predecessors’ value” logic.
A Better Way
Generally a cleaner way of modeling things is to strictly delineate user intent from derived behavior, i.e.:
- The user intent of “this is complete yes/no” is it’s own “thing” (database field)
- The calculated “potential status based on predecessors” logic is it’s own thing (derived field)
- The calculated combination of “status based on user intent or potential status based on predecessors” is it’s own thing (another derived field)
I.e. our data model would move from having a single status
field to:
Task.is_complete
is a boolean that is directly/always controlled by the user intent to mark “yes, this is/is not done”Task.status_based_on_predecessors
(probably not stored/persisted, so not a real column-in-the-db) does the calc of “this task should beInProgress
if all predecessors areComplete
, otherwiseNotStarted
”Task.status
still exists, but is now derived (although likely still persisted for simplicity of reads) by the calculation “ifis_complete
thenComplete
elsestatus_based_on_predecessors
” i.e.InProgress
orNotStarted
This moves the model to be more like a DAG of inputs with nodes of calculated values:
Which makes the application logic more functional, more reactive, rather than if
statements sprinkled in various places.
Granted, a separate but tempting tangent is that reactive / data flow paradigms have not generally taken hold on the server-side yet, especially at a “more than just lifecycle hooks within a single micro-service/monolith/ORM codebase” scale, so you still have to generally nudge/wire these derived values together, but I think the end result is still cleaner than the original “fuzzy ownership of a single field” approach.