Monitoring & Operations Policy
Bridgit Platform (askbridgit.ca)
Version 1.0 | Effective: April 29, 2026 | Next Review: October 29, 2026
1. Purpose and Scope
This policy defines the operational procedures, monitoring practices, backup and recovery processes, and security event detection mechanisms for the Bridgit platform.
Applies to: All personnel with access to Bridgit production systems, source code, or cloud infrastructure.
In scope: Cloud Run services, Cloud SQL database, Redis, GCS storage, GitHub CI/CD pipeline, third-party API integrations, and all application-level logging and monitoring.
Compliance mapping: ISO 27001 A.12, A.13; SOC 2 CC2.1, CC4, CC5.2, CC7.
2. Operational Procedures
Documented Procedures
Operating procedures are maintained in the git repository:
- Deployment: /docs/deployment/DEPLOYMENT_GUIDE.md (staging, production, migrations, rollback, recovery)
- Development standards: CLAUDE.md and /docs/development/ (API, frontend, database, naming conventions)
- Infrastructure: docker-compose.yml defines the full service stack
Documentation is updated when code changes affect procedures. A full documentation audit was conducted in February 2026.
Change Management
Three deployment modes via GitHub Actions:
- code_only: features, bug fixes, UI changes. No database impact.
- schema_change_local_data: test migrations with development data on staging.
- schema_change_production_clone: test migrations against production data on staging.
All deployments require:
- Automated tests passing (npm run test:quick)
- Linting passing (npm run lint)
- Database backup before production push (scripts/prepare-deployment.sh)
- Staging validation before production merge
Emergency changes deployed directly with post-deployment review within 48 hours.
Capacity Management
Cloud Run provides automatic scaling. Cloud SQL connection pooling manages database capacity. AI provider usage monitored via ai_usage_logs. Capacity reviewed when performance issues are reported.
3. Malware Protection
Bridgit is cloud-native with no managed endpoints.
Application level: file upload validation (type and size constraints), npm dependency scanning (npm audit, Dependabot), no executable uploads permitted.
Infrastructure level: Cloud Run containers rebuilt from Dockerfiles on each deployment (ephemeral, stateless). Cloud SQL and GCS managed by Google with provider-level security patching.
Development: developer machines rely on OS-level protections (macOS Gatekeeper, XProtect).
Users report suspicious behavior via the Report a Problem form with security triage fields.
4. Backup and Recovery
Backup Policy
Cloud SQL database:
- Automated daily backups by GCP (7-day retention)
- Manual pg_dump before every production deployment (scripts/prepare-deployment.sh)
Source code: git repository with full history on GitHub.
File uploads: GCS regional redundancy (no separate backup required).
Secrets: GCP Secret Manager maintains secret versions.
Redis: append-only file persistence. Session data is transient.
Gap: local pg_dump backups are not encrypted at rest.
Backup Schedule
- Cloud SQL: daily automated (full), 7-day retention
- pg_dump: per deployment (full), retained locally
- GCS: continuous redundancy, indefinite
Restoration
- Identify appropriate backup (Cloud SQL automated or manual pg_dump)
- Restore via GCP console or psql import
- Verify key tables (users, activity_instances, form_schemas)
- Restart application services
- Confirm functionality via smoke test
Backup restoration tested semi-annually.
5. Logging and Monitoring
Events Logged
- Authentication events (login, logout, failures)
- Authorization events (access granted/denied via RBAC middleware)
- Administrative actions
- System events (startup, shutdown, errors via Cloud Run)
- Data access and modifications (activity_instances audit trail)
- Application events (API requests, AI provider calls)
Log Retention
- Cloud Run application logs: 30 days (GCP default)
- Application audit trail (activity_instances): indefinite (database)
- ai_usage_logs: indefinite (database)
- Git history: indefinite
- GitHub Actions logs: 90 days
Gap: Cloud Run logs should be exported to Cloud Storage or BigQuery for longer retention.
Log Review
- Real-time: Cloud Run error logs monitored during and after deployments
- On-demand: logs reviewed when incidents are reported
- Periodic: Cloud Run metrics checked during deployment cycles
- Escalation: anomalies triaged per Incident Response Policy (P1-P4)
Gap: formal scheduled log review and automated error rate alerting to be implemented.
6. Network Security
Network Controls
- HTTPS enforced on all public endpoints (TLS managed by GCP)
- Cloud Run: no direct network access, all traffic via GCP load balancer
- Cloud SQL: authorized networks, SSL required in production, no public IP
- Redis: internal container network only
- GCS: IAM-controlled, signed URLs for time-limited access
- API: CORS restricted to production frontend origin, JWT validation on all authenticated routes
Segmentation
Cloud-native architecture provides isolation:
- Cloud Run containers have no inbound access beyond HTTPS
- Cloud SQL accessible only from authorized networks
- Redis not publicly accessible
- Secret Manager accessible only via IAM-authenticated API calls
Monitoring
- Cloud Run request logs (volume, latency, error rates)
- Cloud SQL connection metrics
- GCP infrastructure-level DDoS protection for Cloud Run
7. Security Event Detection
Detection Mechanisms
Application-level: authentication failure logging, RBAC unauthorized access logging, API error logging, Report a Problem form with security section.
Infrastructure-level: Cloud Run error rates, Cloud SQL audit logs, GCP IAM audit logs, GitHub audit log.
Dependency-level: npm audit, GitHub Dependabot alerts.
AI monitoring: ai_usage_logs tracks all provider calls for anomalous patterns.
No formal SIEM deployed.
Alerting and Escalation
Follows Incident Response Policy severity levels:
- P1: immediate response (service down, active breach)
- P2: within 1 hour (confirmed unauthorized access, error spike)
- P3: within 4 hours (suspicious activity)
- P4: logged, reviewed at next cycle
Security-flagged Problem Reports reviewed by Platform Administrator.
Gap: automated threshold alerting to be configured.
8. Anomaly Monitoring
Baselines
Established through operational experience:
- Normal API response times and request volumes
- Typical AI provider usage patterns (ai_usage_logs)
- Expected Cloud Run instance counts
- Normal user login patterns
Baselines are informal at current scale. Formal statistical baseline collection is a planned improvement.
Anomaly Detection
Rule-based: authentication failure thresholds, RBAC blocking, file upload validation.
Manual: developer observation during deployments, Problem Report investigation, periodic ai_usage_logs review.
Infrastructure: Cloud Run auto-scaling behavior changes, Cloud SQL slow query logs.
Response Triggers
Automated: Cloud Run container restart, JWT blacklist enforcement, RBAC blocking.
Manual: unusual AI usage investigation, authentication failure investigation, cross-org data report triage, deployment rollback on error spike.
9. IT General Controls
Program Change Management: git version control, GitHub Actions CI/CD, three deployment modes, staging validation, mandatory backup, rollback capability.
Access to Programs and Data: GCP IAM, GitHub repository access, application RBAC, JWT with Redis blacklisting, AES-256-GCM encrypted OAuth tokens.
Computer Operations: Cloud Run auto-scaling and health management, Cloud Scheduler for GDPR jobs, automated daily database backups.
Program Development: PRD-driven process, architect/critic/code review for systemic changes, automated testing and linting, E2E tests.
Configuration management: docker-compose.yml, GitHub Actions workflows, GCP Secret Manager.
10. Information Quality and Internal Controls
Data integrity: PostgreSQL constraints, application-level validation, Sequelize model validation, Survey.js schema validation.
Source reliability: all data from platform database or GCP-managed services.
Timeliness: audit trail recorded at time of action, Cloud Run logs near real-time, AI usage logged synchronously.
Completeness: all API routes pass through middleware (auth, RBAC, billing).
Control issues communicated via: immediate escalation for P1/P2, development backlog for improvements, semi-annual review for gaps.
Critical deficiency escalation: within 24 hours.
11. Policy Administration
- Version: 1.0
- Effective Date: April 29, 2026
- Last Review: April 29, 2026
- Next Review: October 29, 2026
- Owner: Platform Administrator
- Review Frequency: Semi-annually, or after any P1/P2 incident
- Approved By: (signature / name / date)
This policy is maintained alongside the platform source code and is subject to version control. Changes require review and re-approval.