Skip to content

Data Architecture

Validated against PRD v1.0


1. Database Strategy

Single PostgreSQL Instance — Schema per Bounded Context

flowchart TB
    subgraph PG[PostgreSQL]
        ONB[(onboarding<br/>customer, individual<br/>legal_entity, ownership<br/>document)]
        SCR[(screening<br/>request, result<br/>adjudication)]
        RR[(riskrating<br/>assessment<br/>factor)]
        CM[(casemanagement<br/>case, note<br/>evidence, decision)]
        CFG[(configuration<br/>config_version<br/>workflow_template<br/>threshold_config)]
        NA[(networkanalysis<br/>entity_graph<br/>linked_entity)]
        NT[(notification)]
        AUD[(audit<br/>audit_event<br/>immutable, partitioned)]
        AUTH[(auth<br/>user, role, session)]
    end

Why schemas, not separate databases:

  • Single connection pool — no distributed transactions.
  • PostgreSQL schemas enforce logical separation without operational overhead.
  • Cross-schema queries allowed for audit/reporting but NOT for domain logic.
  • Easy extraction: when a module becomes a service, its schema becomes its database.

2. Schema Design Principles

2.1 Entity Identification

All entities use UUIDs as primary keys. Never auto-increment integers.

Why: UUIDs can be generated by the application (not the DB), survive data migrations, and enable future service extraction without key collisions.

CREATE TABLE onboarding.customer (
    customer_id  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    customer_type VARCHAR(20) NOT NULL,  -- INDIVIDUAL, LEGAL_ENTITY
    ...
);

2.2 Timestamps

Every table includes:

created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()

updated_at is set by application code (not DB trigger) to ensure consistency with Temporal workflow timestamps.

2.3 Immutability for Critical Data

The following tables are append-only (no UPDATE, no DELETE):

  • audit.audit_event
  • casemanagement.decision
  • screening.adjudication
  • riskrating.risk_assessment
  • configuration.config_version

If a risk rating is recalculated, a NEW row is inserted. The old row is preserved. This is NOT event sourcing for everything — it's append-only for audit-critical data only.

2.4 JSONB for Flexible Data

Configuration data and workflow context use JSONB:

  • configuration.workflow_template.definition — the workflow definition (states, transitions, conditions)
  • configuration.threshold_config.rules — risk scoring rules
  • onboarding.ownership_relationship.attributes — jurisdiction-specific ownership fields

Why JSONB, not normalized tables: Configurations are versioned blobs that change shape as rules evolve. JSONB allows the config engine to be schema-flexible without migrations. Index with GIN for query performance.


3. Key Table Designs

3.1 Audit Event (Immutable, Append-Only)

CREATE TABLE audit.audit_event (
    event_id        BIGSERIAL PRIMARY KEY,       -- sequential, NOT UUID (for ordering)
    correlation_id  UUID NOT NULL,               -- links events across a single request
    event_type      VARCHAR(50) NOT NULL,         -- STATE_TRANSITION, MODULE_INVOCATION, DECISION, CONFIG_CHANGE, DATA_ACCESS, AUTH
    actor_type      VARCHAR(20) NOT NULL,         -- USER, SYSTEM
    actor_id        UUID,
    context_type    VARCHAR(50) NOT NULL,         -- CUSTOMER, CASE, WORKFLOW
    context_id      UUID NOT NULL,
    action          VARCHAR(200) NOT NULL,        -- e.g., "State transition: ANALYST_REVIEW → EDD_REVIEW"
    payload         JSONB NOT NULL,               -- full event data
    config_version  VARCHAR(50),                  -- config version active at time of event
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Indexes for audit queries
CREATE INDEX idx_audit_context ON audit.audit_event (context_type, context_id, created_at);
CREATE INDEX idx_audit_correlation ON audit.audit_event (correlation_id);
CREATE INDEX idx_audit_type ON audit.audit_event (event_type, created_at);
CREATE INDEX idx_audit_payload ON audit.audit_event USING GIN (payload);

Retention: Partition by created_at (monthly). After 7 years, detach partition and archive to cold storage (NFR-C01).

No UPDATE, no DELETE ever. The event_id is a BIGSERIAL for sequential ordering within a partition. UUIDs are used for cross-system correlation but sequential IDs are needed for "show me what happened next" queries.

3.2 Customer (Core Entity)

CREATE TABLE onboarding.customer (
    customer_id     UUID PRIMARY KEY,
    customer_type   VARCHAR(20) NOT NULL,       -- INDIVIDUAL, LEGAL_ENTITY
    status          VARCHAR(30) NOT NULL,        -- ACTIVE, SUSPENDED, CLOSED, PROHIBITED
    risk_band       VARCHAR(10),                 -- LOW, MEDIUM, HIGH (denormalized from riskrating)
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE onboarding.individual (
    individual_id       UUID PRIMARY KEY REFERENCES onboarding.customer(customer_id),
    first_name          VARCHAR(100) NOT NULL,
    last_name           VARCHAR(100) NOT NULL,
    aliases             JSONB DEFAULT '[]',
    date_of_birth       DATE,
    nationality         VARCHAR(3),              -- ISO 3166 alpha-3
    residence_country   VARCHAR(3),
    tax_residency       VARCHAR(3),
    pep_flag            BOOLEAN DEFAULT FALSE,
    pep_level           VARCHAR(20),            -- NATIONAL, INTERNATIONAL, CLOSE_ASSOCIATE
    encrypted_fields    BYTEA                    -- encrypted PII: DOB, ID numbers (AES-256 at app layer)
);

CREATE TABLE onboarding.legal_entity (
    entity_id               UUID PRIMARY KEY REFERENCES onboarding.customer(customer_id),
    legal_name              VARCHAR(200) NOT NULL,
    trade_name              VARCHAR(200),
    registration_number     VARCHAR(100) NOT NULL,
    incorporation_country   VARCHAR(3),
    legal_form              VARCHAR(50),
    industry_code           VARCHAR(20),
    incorporation_date      DATE
);

3.3 Ownership (Recursive / Graph)

CREATE TABLE onboarding.ownership_relationship (
    relationship_id     UUID PRIMARY KEY,
    parent_entity_id    UUID NOT NULL REFERENCES onboarding.customer(customer_id),
    child_entity_id     UUID NOT NULL REFERENCES onboarding.customer(customer_id),
    ownership_pct       NUMERIC(5,2) NOT NULL CHECK (ownership_pct > 0 AND ownership_pct <= 100),
    control_type        VARCHAR(20) NOT NULL,   -- DIRECT, INDIRECT, BENEFICIAL
    effective_from      DATE,
    effective_to        DATE,
    confidence_score    DOUBLE PRECISION,
    created_at          TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_ownership_parent ON onboarding.ownership_relationship (parent_entity_id);
CREATE INDEX idx_ownership_child ON onboarding.ownership_relationship (child_entity_id);

UBO Traversal (Recursive CTE):

WITH RECURSIVE ownership_chain AS (
    -- Base: direct owners of the target customer
    SELECT parent_entity_id, child_entity_id, ownership_pct, 1 AS depth
    FROM onboarding.ownership_relationship
    WHERE child_entity_id = :target_customer_id
      AND effective_to IS NULL

    UNION ALL

    -- Recursive: owners of owners
    SELECT o.parent_entity_id, o.child_entity_id, o.ownership_pct, oc.depth + 1
    FROM onboarding.ownership_relationship o
    JOIN ownership_chain oc ON o.child_entity_id = oc.parent_entity_id
    WHERE o.effective_to IS NULL
      AND oc.depth < 10  -- max 10 levels
)
SELECT parent_entity_id, SUM(ownership_pct) as total_ownership
FROM ownership_chain
GROUP BY parent_entity_id
HAVING SUM(ownership_pct) >= 25  -- UBO threshold
ORDER BY total_ownership DESC;

This handles basic relationship traversal. When advanced graph analytics are needed (v2+), add Neo4j with a sync from PostgreSQL ownership relationships to Neo4j nodes/edges.

3.4 Configuration (Versioned, Immutable)

CREATE TABLE configuration.config_version (
    version_id      UUID PRIMARY KEY,
    version_number  VARCHAR(20) NOT NULL,
    status          VARCHAR(20) NOT NULL,        -- DRAFT, TEST, ACTIVE, SUPERSEDED
    created_by      UUID NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    activated_at    TIMESTAMPTZ
);

CREATE TABLE configuration.workflow_template (
    template_id     UUID PRIMARY KEY,
    version_id      UUID NOT NULL REFERENCES configuration.config_version(version_id),
    template_name   VARCHAR(100) NOT NULL,
    definition      JSONB NOT NULL,              -- full workflow definition
    UNIQUE (version_id, template_name)
);

CREATE TABLE configuration.threshold_config (
    config_id       UUID PRIMARY KEY,
    version_id      UUID NOT NULL REFERENCES configuration.config_version(version_id),
    domain          VARCHAR(50) NOT NULL,        -- RISK_RATING, SCREENING
    rules           JSONB NOT NULL,              -- threshold rules
    UNIQUE (version_id, domain)
);

Querying active config: SELECT * FROM configuration.config_version WHERE status = 'ACTIVE' — at most one row.

3.5 Case Management

CREATE TABLE casemanagement.case (
    case_id             UUID PRIMARY KEY,
    case_type           VARCHAR(50) NOT NULL,
    state               VARCHAR(30) NOT NULL,
    priority            VARCHAR(10) NOT NULL DEFAULT 'MEDIUM',
    customer_id         UUID NOT NULL,
    workflow_instance_id UUID,
    assigned_to         UUID,
    escalation_level    INT NOT NULL DEFAULT 0,
    sla_deadline        TIMESTAMPTZ,
    resolved_at         TIMESTAMPTZ,
    resolution          VARCHAR(30),
    created_at          TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at          TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Append-only notes
CREATE TABLE casemanagement.case_note (
    note_id     UUID PRIMARY KEY,
    case_id     UUID NOT NULL REFERENCES casemanagement.case(case_id),
    author_id   UUID NOT NULL,
    content     TEXT NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
    -- No updated_at — notes are immutable
);

-- Decisions are immutable
CREATE TABLE casemanagement.decision (
    decision_id             UUID PRIMARY KEY,
    case_id                 UUID NOT NULL REFERENCES casemanagement.case(case_id),
    decision_type           VARCHAR(30) NOT NULL,
    actor_type              VARCHAR(10) NOT NULL,
    actor_id                UUID,
    rationale               TEXT NOT NULL,
    evidence_refs           UUID[] DEFAULT '{}',
    config_version          VARCHAR(50),
    override_justification  TEXT,
    created_at              TIMESTAMPTZ NOT NULL DEFAULT now()
    -- Immutable — no updated_at
);

4. Data Ownership

Schema Owned By Accessed By (Read) Accessed By (Write)
onboarding Onboarding Context Screening, Risk Rating, Case Mgmt (via API, not SQL) Onboarding Context only
screening Screening Context Audit Screening Context only
riskrating Risk Rating Context Audit Risk Rating Context only
casemanagement Case Management Context Audit, Notifications Case Management Context only
configuration Configuration Engine ALL modules (read config version) Configuration Engine only
networkanalysis Network Analysis Context Audit Network Analysis Context only
notification Notification Context None Notification Context only
audit Audit Context ALL (read for replay) ALL contexts (append events)
auth Identity Context ALL (check RBAC) Identity Context only

Rule: No bounded context reads another context's data directly via SQL. Cross-context data access goes through API calls (in monolith: through interface contracts).

Exception: Audit schema is read by all for audit replay. Configuration schema is read by all for config resolution. Both are through shared read-only interfaces, not direct SQL.


5. State Persistence vs. Event Sourcing

Current State (for most entities)

  • Customer, Case, Document, Ownership — standard CRUD. The current state is the source of truth.
  • updated_at tracks when the state last changed.

Append-Only (for audit-critical data)

  • AuditEvent, Decision, Adjudication, RiskAssessment, ConfigVersion — immutable. New versions create new rows.
  • This is NOT full event sourcing. It's append-only state. The "event" is "a new state was recorded."

Event Sourcing (Post-MVP Consideration)

Full event sourcing (where the event log IS the source of truth and state is a projection) is viable for the Audit context but unnecessary complexity for MVP. The audit table already captures every event. If event sourcing is needed later, replay from audit.audit_event.


6. Document / BLOB Storage

MVP: PostgreSQL BYTEA + Metadata Table

CREATE TABLE onboarding.document (
    document_id         UUID PRIMARY KEY,
    customer_id         UUID NOT NULL REFERENCES onboarding.customer(customer_id),
    document_type       VARCHAR(50) NOT NULL,
    file_name           VARCHAR(255) NOT NULL,
    content_type        VARCHAR(100) NOT NULL,
    content             BYTEA NOT NULL,           -- document bytes
    content_hash        VARCHAR(64) NOT NULL,     -- SHA-256 for integrity
    issue_date          DATE,
    expiry_date         DATE,
    validation_status   VARCHAR(20) NOT NULL DEFAULT 'PENDING',
    created_at          TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_document_customer ON onboarding.document (customer_id);
CREATE INDEX idx_document_status ON onboarding.document (validation_status);

Size limit: 10MB per document (enforced in application). For larger documents or scale, migrate to object storage (S3/MinIO) — change storage layer without changing domain logic (the storage_reference field replaces content BYTEA).


7. Migration Strategy

Tool: Flyway

-- V1__create_onboarding_schema.sql
CREATE SCHEMA IF NOT EXISTS onboarding;

-- V2__create_customer_table.sql
CREATE TABLE onboarding.customer (...);

-- V3__create_audit_schema.sql
CREATE SCHEMA IF NOT EXISTS audit;

-- ...

Rules:

  • Migrations are sequential and immutable (never modify a committed migration).
  • Each bounded context has its own migration directory (under src/main/resources/db/migration/<context>/).
  • CI pipeline runs migrations against a test database before deployment.

8. Indexing Strategy

Table Index Purpose
audit.audit_event (context_type, context_id, created_at) Audit replay by case/customer
audit.audit_event (correlation_id) Trace a single request
audit.audit_event (event_type, created_at) Filter by event type
casemanagement.case (assigned_to, state) Analyst queue
casemanagement.case (state, sla_deadline) SLA monitoring
casemanagement.case (customer_id) Customer case history
onboarding.customer (status) Active customer list
onboarding.ownership_relationship (child_entity_id) UBO traversal
onboarding.document (customer_id, validation_status) Document status
screening.screening_result (match_status, created_at) Pending adjudication queue
configuration.config_version (status) Active config lookup
configuration.workflow_template (version_id, template_name) Template resolution

9. Encryption

At Rest

PostgreSQL TDE (Transparent Data Encryption) or filesystem-level encryption. Managed by deployment environment.

At Application Level (for PII)

Sensitive fields encrypted before storage:

@Entity
class Individual {
    @Column(columnDefinition = "BYTEA")
    var encryptedFields: ByteArray? = null  // AES-256-GCM encrypted JSON of {dob, idNumbers, ...}
}

What gets encrypted: date of birth, identification numbers, residence address details.
What does not: names (needed for screening), nationality (needed for risk rating), PEP flags.
Key management: Encryption keys in environment variables or vault (HashiCorp Vault post-MVP).


Data architecture designed for PRD v1.0 MVP (PostgreSQL, schema-per-context). Re-evaluate if graph or search requirements exceed PostgreSQL capabilities.