Skip to main content

Caveats to state comparison

The state: selection method is a powerful feature, with a lot of underlying complexity. Below are a handful of considerations when setting up automated jobs that leverage state comparison.

Seeds

dbt stores a file hash of seed files that are <1 MiB in size. If the contents of these seeds is modified, the seed will be included in state:modified.

If a seed file is >1 MiB in size, dbt cannot compare its contents and will raise a warning as such. Instead, dbt will use only the seed's file path to detect changes. If the file path has changed, the seed will be included in state:modified; if it hasn't, it won't.

Macros

dbt will mark modified any resource that depends on a changed macro, or on a macro that depends on a changed macro.

Vars

If a model uses a var or env_var in its definition, dbt is unable today to identify that lineage in such a way that it can include the model in state:modified because the var or env_var value has changed. It's likely that the model will be marked modified if the change in variable results in a different configuration.

Tests

The command dbt test -s state:modified will include both:

  • tests that select from a new/modified resource
  • tests that are themselves new or modified

As long as you're adding or changing tests at the same time that you're adding or changing the resources (models, seeds, snapshots) they select from, all should work the way you expect with "simple" state selection:

dbt run -s "state:modified"
dbt test -s "state:modified"

This can get complicated, however. If you add a new test without modifying its underlying model, or add a test that selects from a new model and an old unmodified one, you may need to test a model without having first run it.

You can defer upstream references when testing. For example, if a test selects from a model that doesn't exist as a database object in your current environment, dbt will look to the other environment instead—the one defined in your state manifest. This enables you to use "simple" state selection without risk of query failure, but it may have some surprising consequences for tests with multiple parents. For instance, if you have a relationships test that depends on one modified model and one unmodified model, the test query will select from data "across" two different environments. If you limit or sample your data in development and CI, it may not make much sense to test for referential integrity, knowing there's a good chance of mismatch.

If you're a frequent user of relationships tests or data tests, or frequently find yourself adding tests without modifying their underlying models, consider tweaking the selection criteria of your CI job. For instance:

dbt run -s "state:modified"
dbt test -s "state:modified" --exclude "test_name:relationships"

False positives

Final note

State comparison is complex. We hope to reach eventual consistency between all configuration options, as well as providing users with the control they need to reliably return all modified resources, and only the ones they expect. If you're interested in learning more, read open issues tagged "state" in the dbt repository.

0