Monitoring & Operations Policy

Bridgit Platform (askbridgit.ca)
Version 1.0 | Effective: April 29, 2026 | Next Review: October 29, 2026

1. Purpose and Scope

This policy defines the operational procedures, monitoring practices, backup and recovery processes, and security event detection mechanisms for the Bridgit platform.

Applies to: All personnel with access to Bridgit production systems, source code, or cloud infrastructure.

In scope: Cloud Run services, Cloud SQL database, Redis, GCS storage, GitHub CI/CD pipeline, third-party API integrations, and all application-level logging and monitoring.

Compliance mapping: ISO 27001 A.12, A.13; SOC 2 CC2.1, CC4, CC5.2, CC7.

2. Operational Procedures

Documented Procedures

Operating procedures are maintained in the git repository:

Deployment: /docs/deployment/DEPLOYMENT_GUIDE.md (staging, production, migrations, rollback, recovery)
Development standards: CLAUDE.md and /docs/development/ (API, frontend, database, naming conventions)
Infrastructure: docker-compose.yml defines the full service stack

Documentation is updated when code changes affect procedures. A full documentation audit was conducted in February 2026.

Change Management

Three deployment modes via GitHub Actions:

code_only: features, bug fixes, UI changes. No database impact.
schema_change_local_data: test migrations with development data on staging.
schema_change_production_clone: test migrations against production data on staging.

All deployments require:

Automated tests passing (npm run test:quick)
Linting passing (npm run lint)
Database backup before production push (scripts/prepare-deployment.sh)
Staging validation before production merge

Emergency changes deployed directly with post-deployment review within 48 hours.

Capacity Management

Cloud Run provides automatic scaling. Cloud SQL connection pooling manages database capacity. AI provider usage monitored via ai_usage_logs. Capacity reviewed when performance issues are reported.

3. Malware Protection

Bridgit is cloud-native with no managed endpoints.

Application level: file upload validation (type and size constraints), npm dependency scanning (npm audit, Dependabot), no executable uploads permitted.

Infrastructure level: Cloud Run containers rebuilt from Dockerfiles on each deployment (ephemeral, stateless). Cloud SQL and GCS managed by Google with provider-level security patching.

Development: developer machines rely on OS-level protections (macOS Gatekeeper, XProtect).

Users report suspicious behavior via the Report a Problem form with security triage fields.

4. Backup and Recovery

Backup Policy

Cloud SQL database:

Automated daily backups by GCP (7-day retention)
Manual pg_dump before every production deployment (scripts/prepare-deployment.sh)

Source code: git repository with full history on GitHub.

File uploads: GCS regional redundancy (no separate backup required).

Secrets: GCP Secret Manager maintains secret versions.

Redis: append-only file persistence. Session data is transient.

Gap: local pg_dump backups are not encrypted at rest.

Backup Schedule

Cloud SQL: daily automated (full), 7-day retention
pg_dump: per deployment (full), retained locally
GCS: continuous redundancy, indefinite

Restoration

Identify appropriate backup (Cloud SQL automated or manual pg_dump)
Restore via GCP console or psql import
Verify key tables (users, activity_instances, form_schemas)
Restart application services
Confirm functionality via smoke test

Backup restoration tested semi-annually.

5. Logging and Monitoring

Events Logged

Authentication events (login, logout, failures)
Authorization events (access granted/denied via RBAC middleware)
Administrative actions
System events (startup, shutdown, errors via Cloud Run)
Data access and modifications (activity_instances audit trail)
Application events (API requests, AI provider calls)

Log Retention

Cloud Run application logs: 30 days (GCP default)
Application audit trail (activity_instances): indefinite (database)
ai_usage_logs: indefinite (database)
Git history: indefinite
GitHub Actions logs: 90 days

Gap: Cloud Run logs should be exported to Cloud Storage or BigQuery for longer retention.

Log Review

Real-time: Cloud Run error logs monitored during and after deployments
On-demand: logs reviewed when incidents are reported
Periodic: Cloud Run metrics checked during deployment cycles
Escalation: anomalies triaged per Incident Response Policy (P1-P4)

Gap: formal scheduled log review and automated error rate alerting to be implemented.

6. Network Security

Network Controls

HTTPS enforced on all public endpoints (TLS managed by GCP)
Cloud Run: no direct network access, all traffic via GCP load balancer
Cloud SQL: authorized networks, SSL required in production, no public IP
Redis: internal container network only
GCS: IAM-controlled, signed URLs for time-limited access
API: CORS restricted to production frontend origin, JWT validation on all authenticated routes

Segmentation

Cloud-native architecture provides isolation:

Cloud Run containers have no inbound access beyond HTTPS
Cloud SQL accessible only from authorized networks
Redis not publicly accessible
Secret Manager accessible only via IAM-authenticated API calls

Monitoring

Cloud Run request logs (volume, latency, error rates)
Cloud SQL connection metrics
GCP infrastructure-level DDoS protection for Cloud Run

7. Security Event Detection

Detection Mechanisms

Application-level: authentication failure logging, RBAC unauthorized access logging, API error logging, Report a Problem form with security section.

Infrastructure-level: Cloud Run error rates, Cloud SQL audit logs, GCP IAM audit logs, GitHub audit log.

Dependency-level: npm audit, GitHub Dependabot alerts.

AI monitoring: ai_usage_logs tracks all provider calls for anomalous patterns.

No formal SIEM deployed.

Alerting and Escalation

Follows Incident Response Policy severity levels:

P1: immediate response (service down, active breach)
P2: within 1 hour (confirmed unauthorized access, error spike)
P3: within 4 hours (suspicious activity)
P4: logged, reviewed at next cycle

Security-flagged Problem Reports reviewed by Platform Administrator.

Gap: automated threshold alerting to be configured.

8. Anomaly Monitoring

Baselines

Established through operational experience:

Normal API response times and request volumes
Typical AI provider usage patterns (ai_usage_logs)
Expected Cloud Run instance counts
Normal user login patterns

Baselines are informal at current scale. Formal statistical baseline collection is a planned improvement.

Anomaly Detection

Rule-based: authentication failure thresholds, RBAC blocking, file upload validation.

Manual: developer observation during deployments, Problem Report investigation, periodic ai_usage_logs review.

Infrastructure: Cloud Run auto-scaling behavior changes, Cloud SQL slow query logs.

Response Triggers

Automated: Cloud Run container restart, JWT blacklist enforcement, RBAC blocking.

Manual: unusual AI usage investigation, authentication failure investigation, cross-org data report triage, deployment rollback on error spike.

9. IT General Controls

Program Change Management: git version control, GitHub Actions CI/CD, three deployment modes, staging validation, mandatory backup, rollback capability.

Access to Programs and Data: GCP IAM, GitHub repository access, application RBAC, JWT with Redis blacklisting, AES-256-GCM encrypted OAuth tokens.

Computer Operations: Cloud Run auto-scaling and health management, Cloud Scheduler for GDPR jobs, automated daily database backups.

Program Development: PRD-driven process, architect/critic/code review for systemic changes, automated testing and linting, E2E tests.

Configuration management: docker-compose.yml, GitHub Actions workflows, GCP Secret Manager.

10. Information Quality and Internal Controls

Data integrity: PostgreSQL constraints, application-level validation, Sequelize model validation, Survey.js schema validation.

Source reliability: all data from platform database or GCP-managed services.

Timeliness: audit trail recorded at time of action, Cloud Run logs near real-time, AI usage logged synchronously.

Completeness: all API routes pass through middleware (auth, RBAC, billing).

Control issues communicated via: immediate escalation for P1/P2, development backlog for improvements, semi-annual review for gaps.

Critical deficiency escalation: within 24 hours.

11. Policy Administration

Version: 1.0
Effective Date: April 29, 2026
Last Review: April 29, 2026
Next Review: October 29, 2026
Owner: Platform Administrator
Review Frequency: Semi-annually, or after any P1/P2 incident
Approved By: (signature / name / date)

This policy is maintained alongside the platform source code and is subject to version control. Changes require review and re-approval.