Skip to content

Backfill migrations

Goal

Apply a schema change AND populate the new columns from existing data — safely, idempotently, and in chunks for large tables.

The simple case

When the table is small (< 100k rows), a one-shot UPDATE is fine:

python
# my_plugin/migrations/005_normalize_panel_names.py
from mint_sdk.migrations import MigrationOps, PluginMigration


class Migration(PluginMigration):
    revision = "005"
    description = "add normalized_name and backfill from name"

    async def upgrade(self, ops: MigrationOps) -> None:
        await ops.add_column(
            "panels",
            ops.column("normalized_name", "text", nullable=True),
        )
        await ops.execute(
            "UPDATE panels SET normalized_name = LOWER(name) "
            "WHERE normalized_name IS NULL"
        )
        await ops.create_index("idx_panels_normalized", "panels", ["normalized_name"])

The WHERE normalized_name IS NULL makes it idempotent — re-running on a partial application picks up only the rows that didn't update last time.

Chunked backfill for large tables

UPDATE on millions of rows holds a long transaction and can lock the table. Chunk it:

python
# my_plugin/migrations/006_backfill_panel_dose_units.py
from mint_sdk.migrations import MigrationOps, PluginMigration


CHUNK_SIZE = 5_000


class Migration(PluginMigration):
    revision = "006"
    description = "backfill dose_units = 'uM' on legacy panels"

    async def upgrade(self, ops: MigrationOps) -> None:
        # Schema change first
        await ops.add_column(
            "panels",
            ops.column("dose_units", "text", nullable=True),
        )

        # Chunked backfill
        while True:
            result = await ops.execute(
                """
                WITH batch AS (
                    SELECT id FROM panels
                    WHERE dose_units IS NULL
                    LIMIT :limit
                    FOR UPDATE SKIP LOCKED
                )
                UPDATE panels
                SET dose_units = 'uM'
                WHERE id IN (SELECT id FROM batch)
                RETURNING id
                """,
                params={"limit": CHUNK_SIZE},
            )
            rows = result.fetchall() if hasattr(result, "fetchall") else result
            if not rows:
                break

        # Now that every row has dose_units, enforce NOT NULL + index
        await ops.execute("ALTER TABLE panels ALTER COLUMN dose_units SET NOT NULL")
        await ops.create_index("idx_panels_dose_units", "panels", ["dose_units"])

Key techniques:

  • FOR UPDATE SKIP LOCKED — never blocks on rows another transaction is touching. Postgres-only; SQLite skips the lock keyword harmlessly.
  • LIMIT :limit — bounds each transaction's row count.
  • RETURNING id — lets the loop know whether it did any work this iteration.
  • Tighten constraints lastNOT NULL and indexes after the data is correct, otherwise you risk failing on the first inconsistent row.

Splitting schema and data into separate revisions

For very large datasets, separate the schema from the backfill so the schema change is fast and the backfill can take its time:

006_add_dose_units_column.py        # add nullable column + ship a release
007_backfill_dose_units.py          # backfill in chunks; idempotent
008_make_dose_units_required.py     # NOT NULL + index, only after 007 has run everywhere

Each migration becomes a small, easy-to-review change. The plugin author can also run 007 manually outside startup if needed (e.g., during a maintenance window).

Online backfill

If the plugin is actively writing while the backfill runs, your migration must coexist with the application:

PhaseApplication codeMigration
1: Add nullable columnReads tolerate NULL; writes leave column NULLadd_column (fast, no lock)
2: App starts double-writingWrites fill the new column with the computed value(no migration)
3: Backfill old rows(no change)Chunked UPDATE
4: Make column requiredApp ensures every code path writes the columnALTER ... SET NOT NULL
5: Drop the old columnReads use the new column onlydrop_column (use carefully)

This is a 5-step migration; ship as 5 separate plugin releases or 5 separate revisions in one release. Each step is reversible.

Testing chunked backfills

Use a temporary SQLite engine and drive MigrationRunner directly:

python
# tests/test_migrations.py
import pytest
from sqlalchemy.ext.asyncio import create_async_engine
from mint_sdk.migrations import MigrationRunner

from my_plugin.migrations.001_initial import CreatePanelsTable
from my_plugin.migrations.005_normalize_panel_names import NormalizePanelNames
from my_plugin.migrations.006_backfill_dose_units import BackfillDoseUnits


@pytest.mark.asyncio
async def test_006_handles_partial_application(tmp_path):
    engine = create_async_engine(f"sqlite+aiosqlite:///{tmp_path / 'test.db'}")
    runner = MigrationRunner(engine, plugin_name="my_plugin", dialect="sqlite")

    # Apply up to 005
    result = await runner.run([CreatePanelsTable(), NormalizePanelNames()])
    assert result.applied == [1, 5]

    # Insert ~12,500 rows directly via SQL
    # (helper omitted for brevity)

    # Apply 006 — backfill kicks in
    result = await runner.run([
        CreatePanelsTable(),
        NormalizePanelNames(),
        BackfillDoseUnits(),
    ])
    assert result.applied == [6]
    assert not result.errors

For SQLite-backed tests, FOR UPDATE SKIP LOCKED is silently ignored — the test still verifies correctness, just without proving the locking semantics under contention.

Notes

  • Backfills inside a single PluginMigration.upgrade run inside one transaction by default. Long-running ones can blow the lock timeout — consider opening a fresh session per chunk via await ops.execute("COMMIT; BEGIN;") if your pattern allows it.
  • ops.execute returns whatever SQLAlchemy returns — Result for queries, CursorResult for DML. Check the docs of the Result API for the version of SQLAlchemy mint-sdk ships against.
  • For backfills that depend on application-level logic (e.g., complex computed values), consider a separate background task instead of an in-migration loop. Migrations should focus on schema; complex data work belongs in the application.

MINT is open source. Made by the Morscher Lab.