Risk Register & Mitigation Strategies¶
Validated against PRD v1.0
Risk Matrix¶
| Risk | Likelihood | Impact | Score | Owner |
|---|---|---|---|---|
| R1: Temporal learning curve slows Phase 2 | Medium (4) | High (4) | 16 | Engineering Lead |
| R2: Modular monolith becomes tightly coupled | Medium (4) | High (5) | 20 | Tech Lead / CI |
| R3: External integrations unstable | Medium (3) | Medium (3) | 9 | Integration Lead |
| R4: Configuration engine complexity | Medium (4) | High (4) | 16 | Config Module Owner |
| R5: Corporate onboarding scope creep | High (5) | High (4) | 20 | Product Owner |
| R6: Team lacks FEC domain expertise | Medium (3) | Medium (3) | 9 | Domain Expert / PO |
| R7: PostgreSQL recursive queries slow at scale | Low (2) | Medium (3) | 6 | Data Architect |
| R8: Performance targets not met | Low (2) | High (4) | 8 | Performance Lead |
| R9: Regulatory requirements change during build | Low (2) | High (5) | 10 | Compliance Advisor |
| R10: Key person dependency (Temporal/Kotlin expertise) | Medium (3) | High (4) | 12 | Engineering Manager |
| R11: Audit log volume exceeds capacity | Low (2) | Medium (3) | 6 | Data Architect |
| R12: Security vulnerability discovered | Low (2) | Critical (5) | 10 | Security Lead |
Scale: Likelihood 1-5 (Rare→Almost Certain), Impact 1-5 (Negligible→Critical). Score = Likelihood × Impact.
Detailed Risks¶
R1: Temporal Learning Curve¶
Description: Temporal introduces concepts (workflows, activities, signals, versioning) that differ from traditional request-response programming. Team may need 2-3 weeks to become productive.
Impact: Phase 2 (Workflow Engine) could slip from 4 weeks to 6-7 weeks.
Mitigation: - Phase 0 includes a "Hello World" Temporal PoC — team gets hands-on experience before Phase 2. - Temporal documentation and examples are excellent. Recommended training path documented. - Start with simple linear workflow, add complexity incrementally (branching → parallel → signals) rather than building the full Corporate NL template at once.
Fallback: If Temporal proves too complex in Phase 0, evaluate Camunda 8 as an alternative. Camunda's BPMN model may be more intuitive for business-process-oriented developers.
R2: Modular Monolith Coupling¶
Description: The modular monolith can degrade into a "ball of mud" if module boundaries are not enforced. One module importing from another's internal package is a one-line code change but breaks architectural integrity.
Impact: Extraction to services becomes a rewrite, not mechanical. Development velocity slows as changes ripple across modules.
Mitigation:
- ArchUnit tests in CI from Phase 0, Day 1. Any cross-module internal import fails the build.
- Code review checklist includes: "Does this PR cross bounded context boundaries?"
- Shared kernel (shared/) kept minimal. Reviewed quarterly for bloat.
- Module interface contracts enforced at the type level (interface in shared.contract, implementation in domain module).
Leading indicator: ArchUnit violation count. If > 0, stop and fix immediately.
R3: External Integrations Unstable¶
Description: Sanctions list providers, PEP databases, and corporate registries may be unreliable, slow, or change their API without notice.
Impact: Screening and identity validation become unreliable. Workflows stall.
Mitigation: - Adapter + fallback pattern: every external integration has a mock provider for development and a fallback wrapper for production. - Graceful degradation (NFR-R02): if external service unavailable, mark task BLOCKED, notify operator, retry when available. - Provider selection: evaluate multiple commercial sanctions/PEP providers before committing. Negotiate SLA with chosen provider. - List data cached with TTL (24h default, configurable per provider). Screening runs against cached data if provider is temporarily unavailable.
R4: Configuration Engine Complexity¶
Description: The config engine is the critical path dependency (PRD §6.12). It must be flexible enough for all domains but simple enough to build in Phase 1 (weeks 3-4, same as initial project standup).
Impact: Delays to config engine delay everything downstream.
Mitigation: - MVP config scope: Workflow templates + thresholds + document rules. NOT approval matrices, routing rules, or custom fields — those are Phase 1.5. - JSONB for config storage — no schema migrations for rule changes. - Start with a single config version (ACTIVE only). Add DRAFT/TEST/SUPERSEDED promotion pipeline in Phase 1.5 after the core engine works. - Config Admin UI: start with JSON editor (quick). Add form-based UI in Phase 1.5.
R5: Corporate Onboarding Scope Creep¶
Description: The corporate onboarding use case is the proving case but also the most complex (multi-level ownership, PEP exposure, sanctions near-hits, EDD branching). Attempting to build the full complexity in Phase 2 risks never shipping.
Impact: Phase 2 balloons from 4 weeks to 8+ weeks.
Mitigation: - Start with Retail Individual flow in Phase 2 (simplest path: no ownership, no EDD). This proves the workflow engine works. - Add Corporate flow incrementally: basic ownership → UBO identification → EDD branching → multi-jurisdiction. - Time-box EDD features: deep due diligence is v2. MVP EDD = analyst writes report in case notes, reviewer approves. - Product Owner gates scope. Any new requirement must displace something of equal size.
R6: FEC Domain Knowledge Gap¶
Description: Engineers may not understand KYC/CDD concepts (UBO, PEP, EDD, sanctions adjudication). Misunderstanding leads to incorrect implementations.
Impact: Compliance features implemented incorrectly. Regulatory findings.
Mitigation: - Client documents provide domain context. All engineers read the Business Concept document and Onboarding Specification before Phase 2. - Domain glossary maintained in docs. All entities and processes defined with real-world meaning. - Compliance reviewer (client or external) validates domain correctness at each phase exit. - Domain events and state machines named in business language (not technical jargon).
R7: PostgreSQL Slow at Scale¶
Description: Recursive CTEs for ownership traversal may become slow with deep chains (> 10 levels) or large entity graphs.
Impact: UBO identification > 2 seconds. Analysts wait for graph visualization.
Mitigation:
- Performance threshold defined: UBO traversal must complete in < 1 second. Test with 10-level ownership chains in Phase 3c.
- Index on ownership_relationship(child_entity_id) — already in data architecture.
- Materialized path approach as Plan B: store ancestor_chain as TEXT[] for O(1) ancestor lookups.
- Neo4j as Plan C: when graph queries exceed PostgreSQL capabilities, migrate to Neo4j with sync from PostgreSQL.
Leading indicator: P95 latency of GET /api/v1/network-analysis/graph. Alert if > 1s.
R8: Performance Targets Not Met¶
Description: NFRs define specific targets (2s UI, 5s screening, 100 concurrent workflows). These may not be achievable in Phase 2 with all modules running.
Impact: Poor analyst UX. Slow screening blocks onboarding.
Mitigation: - Performance testing in Phase 6 (not earlier — premature optimization). - Screening: batch subjects (customer + UBOs) into a single provider call. Cache watchlist data locally. - Risk Rating: pre-compute on data change, not on every access. - UI: paginate case list. Load workspace panels on-demand (lazy load network graph, load screening results when tab clicked). - If performance targets missed: profile, identify bottleneck, address. Do not lower targets.
R9: Regulatory Requirements Change¶
Description: AML/CTF regulations, sanctions lists, or PEP definitions may change during the 20-week build.
Impact: Already-built features need modification. Audit model may need new fields.
Mitigation:
- Configuration-driven rules (not hardcoded). Regulatory changes that affect thresholds or lists are config changes, not code changes.
- Audit event uses JSONB payload — new fields can be added without schema migration.
- Jurisdiction-specific logic abstracted behind JurisdictionRules interface. Swap implementations per country.
- Design Phase 0-2 with EU/Netherlands rules as starting point. Generalize in Phase 3+.
R10: Key Person Dependency¶
Description: Kotlin/Temporal expertise may be concentrated in one or two engineers. If they leave or are unavailable, velocity drops.
Impact: Phases dependent on that expertise stall.
Mitigation: - Pair programming for all Temporal workflows (no solo Temporal work in Phase 2). - Kotlin: hire for Java experience. Kotlin is a gentle step from Java — 2-week ramp-up. - All design decisions documented (architecture docs serve as onboarding material). - CI enforces code standards — no "expert-only" code paths.
R11: Audit Event Volume¶
Description: 1,000 audit events/second target (NFR-P04). A single onboarding workflow with parallel tasks can generate 50+ events in seconds. If 100 workflows run concurrently, that's 5,000+ events in a burst.
Impact: PostgreSQL write bottleneck. Audit write can become the slowest component.
Mitigation: - Audit events are append-only (no UPDATE, no DELETE) — writes are fast. - Batch writes: accumulate events in memory (100ms buffer), flush in batches of 50-100. - Partition by month from day one. Query against recent partitions is fast; old partitions are rarely accessed. - If PostgreSQL becomes bottleneck: offload audit writes to Kafka, with async consumer writing to PostgreSQL.
Leading indicator: P99 write latency of POST /api/v1/audit/events. Alert if > 50ms.
R12: Security Vulnerability¶
Description: A vulnerability in Spring Boot, React, PostgreSQL, or Temporal could be discovered during the build. Zero-day particularly dangerous.
Impact: Platform may need emergency patching. Compliance risk if vulnerability exploited before patch.
Mitigation: - Dependency scanning in CI (OWASP Dependency Check, Snyk). Alert on critical/high CVEs. - Regular upgrades: Spring Boot and Temporal minor versions updated within 1 week of release. - Penetration test in Phase 6 before go-live. - Rate limiting, correlation IDs, audit logging from day one — detect suspicious activity early.
Risk Monitoring¶
| Metric | Frequency | Owner |
|---|---|---|
| ArchUnit violations | Every build (CI) | Tech Lead |
| P95 API latency | Weekly (Phase 3+) | Performance Lead |
| Audit event write latency | Weekly (Phase 5+) | Data Architect |
| Dependency CVEs | Weekly (CI) | Security Lead |
| Phase milestone slippage | Daily standup | Engineering Manager |
| Scope change count | Sprint review | Product Owner |
Risk register validated against PRD v1.0 scope and all domain specs. Re-evaluate at each phase exit.