DocsGitHub7

Pipeline Lineage

Moose Lineage Manifest

The lineage manifest is a static description of a pipeline's systems and data flow. It lives under the implementation directory at moose/lineage.manifest.json (preferred) or lineage/manifest.json.

For full details on entity and relationship types, see the sections below.

Entities (Nodes)

Each node has: id, type, name, namespace, version, attrs.

  • Allowed types: connector, ingest_api, stream, dlq, transform, sync, table, materialized_view, external_table, consumption_api, openapi_spec, client, workflow
  • Required attrs per type (minimal):
    • connector: { mode: "webhook"|"etl"|"cdc", schema_hash }
    • ingest_api: { route, version, auth: { method: "jwt"|"api_key"|"none", audience?: string } }
    • stream: { partitions, retention_seconds }
    • dlq: { backing: "stream"|"table" }
    • transform: { code_ref: { repo, path, commit, line? }, dlq?: nodeId }
    • sync: { semantics: "at_least_once", flush: { rows?, interval_ms? }, offset_tracking: true }
    • table: { physical_name, engine, order_by, deduplicate?: boolean }
    • materialized_view: { target_table, select_from: string[] }
    • external_table: { provider: "clickpipes"|"debezium"|"aws_dms", lifecycle: "externally_managed" }
    • consumption_api: { route, query_spec: { params_schema_ref, tables_referenced: string[] }, auth }
    • openapi_spec: { path: ".moose/openapi.yaml" }
    • client: { kind: "dashboard"|"service"|"agent", sdk?: { language, version } }
    • workflow: { kind: "workflow"|"task", schedule?: string }

Recommended extras for connector nodes:

  • connector: { name, version?, author?, language?, implementation? }
  • identifier (e.g., GA4 property properties/1234)
  • schema_path (repo-relative path to a relevant schema file)

Relationships (Edges)

Each edge has: from, to, type, attrs.

  • Allowed types: produces, publishes, dead_letters_to, transforms, emits, syncs_to, writes, derives, reads, queries, serves, documents, triggers, backfills, retries_from
  • Common edge attrs (optional):
    • schema_from_hash, schema_to_hash
    • privacy_tags: string[] (e.g., ["pii_email","pii_phone"])
    • policy: { retention_days?, encryption?: "at_rest"|"none" }

Source of truth for types

Runtime types live in packages/models/src/lineage.ts. Keep docs and scaffolds in sync with these types.

Understanding Lineage

Data lineage tracks the flow of data through your pipeline, providing visibility into:

  • Where data originates (source systems)
  • How data is transformed
  • Where data lands (destination systems)
  • Dependencies between data elements

Lineage Manifest

Each pipeline includes a lineage manifest that describes the data flow:

{
  "nodes": [
    {
      "id": "ga-source",
      "type": "source",
      "name": "Google Analytics",
      "namespace": "google-analytics",
      "version": "v4"
    },
    {
      "id": "transform-1",
      "type": "transformation",
      "name": "Normalize Events",
      "namespace": "pipeline"
    },
    {
      "id": "clickhouse-dest",
      "type": "destination", 
      "name": "ClickHouse Analytics",
      "namespace": "clickhouse"
    }
  ],
  "edges": [
    {
      "from": "ga-source",
      "to": "transform-1",
      "type": "data_flow"
    },
    {
      "from": "transform-1",
      "to": "clickhouse-dest",
      "type": "data_flow"
    }
  ]
}

Generating Lineage Diagrams

Use the provided scripts to generate visual lineage diagrams:

# Generate Mermaid diagram
pnpm run generate:lineage:mermaid

# Generate SVG diagram
pnpm run generate:lineage:svg

# Generate interactive visualization
pnpm run generate:lineage:interactive

Lineage Schema References

Pipelines can reference connector schemas to build complete lineage:

{
  "datasets": [
    {
      "kind": "pointer",
      "name": "GA Events",
      "connector": {
        "name": "google-analytics",
        "version": "v4",
        "author": "514-labs",
        "language": "typescript"
      }
    }
  ]
}

Benefits of Lineage Tracking

  • Impact analysis - Understand downstream effects of changes
  • Debugging - Trace data issues to their source
  • Compliance - Document data flows for regulatory requirements
  • Documentation - Auto-generated visual documentation