```html

Building a Real-Time GA4 Data Pipeline and Orchestrator-Driven Reporting System

Over the past development session, we implemented a comprehensive Google Analytics 4 data collection audit, built programmatic API access for traffic analysis, and integrated the results into an orchestrator-driven reporting system. This post details the architecture decisions, implementation patterns, and operational changes that enable automated, scalable analytics reporting across all Sail JADA properties.

The Problem: Manual Analytics, Silent Blind Spots

Before this work, analytics reporting was manual—GA4 data lived in dashboards, traffic insights weren't programmatic, and there was no centralized way to detect missing instrumentation across properties. Email campaigns shipped without baseline traffic analysis, and we had no systematic way to recommend optimizations.

Three critical gaps emerged:

  • No GA Data API access: Zero programmatic traffic data available for automation and reporting.
  • Instrumentation unknowns: No audit of which pages across which properties had GA tracking codes.
  • Disconnected workflows: Campaign scheduling, traffic analysis, and recommendations lived in separate silos.

Solution Architecture: Three-Layer System

We built a three-layer system: collection layer (GA code audit), ingestion layer (GA4 Data API), and reporting layer (orchestrator-driven card generation).

Layer 1: GA Code Audit Across All Properties

The first step was knowing what we're actually tracking. We created an audit script that crawls HTML files across all deployed properties and extracts GA tracking codes.

Scope: Three main properties—JADA (Queen of San Diego), QOS, and internal dashboards.

Implementation: Python script that:

  • Walks HTML file trees in /Users/cb/Documents/repos
  • Parses <script> tags for GA tag manager (GTM) IDs and GA4 measurement IDs
  • Extracts property IDs and measurement protocol endpoints
  • Compares found codes against expected GA4 properties from Admin console
  • Generates a report identifying pages with missing or misconfigured tracking

Key finding: The dashboard at progress.queenofsandiego.com was fully instrumented with hash-based deep linking support (/#card-{id} format), making it trackable for SPA navigation events.

Layer 2: OAuth Service Account Setup for GA4 Data API

To move from manual reporting to programmatic access, we needed authenticated access to Google Analytics 4 Data API v1.

Setup steps:

  • Created a service account in Google Cloud Console for the analytics project
  • Generated a private key credential file (JSON format) stored securely outside the repo
  • Granted the service account "Viewer" role on the GA4 property via Analytics Admin console
  • Installed the Google Analytics Data API Python client: pip install google-analytics-data

Authentication flow: The existing pattern in /Users/cb/Documents/repos/tools/reauth_ga.py uses Google's ADC (Application Default Credentials). The script loads the service account key and exchanges it for short-lived OAuth tokens with scopes limited to read-only Analytics Data API access.

API Query Pattern:

from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import RunReportRequest

client = BetaAnalyticsDataClient()
property_id = "properties/XXXXXXXXXX"  # From GA Admin

request = RunReportRequest(
    property=property_id,
    date_ranges=[{"start_date": "30daysAgo", "end_date": "today"}],
    metrics=[{"name": "activeUsers"}, {"name": "screenPageViews"}],
    dimensions=[{"name": "pagePath"}],
)
response = client.run_report(request)

This query pulls the last 30 days of traffic segmented by page path—exactly what we need for baseline metrics and anomaly detection.

Layer 3: Orchestrator-Driven Report Generation

Raw data isn't useful. We built an orchestrator agent that consumes GA4 traffic data, inspects email campaign schedules, and generates actionable kanban cards on the dashboard.

Orchestrator brief includes:

  • GA4 traffic for the last 30 days (pageviews, active users, bounce rate by page)
  • Instrumentation audit results (which pages are missing GA codes)
  • Constant Contact campaign schedule (upcoming blasts with status)
  • Current campaign dedup logs (who's already received what)

Output: The orchestrator generates a live dashboard card (ID: t-31aa2593, accessible at https://progress.queenofsandiego.com/#card-t-31aa2593) with five structured sections:

  • Traffic Baseline: Top 10 pages by pageviews, traffic trends, bounce rate hotspots
  • Instrumentation Gaps: List of pages missing GA codes by property
  • Campaign Status: Upcoming email blasts with approval status and send readiness
  • Traffic Recommendations: Pages with high bounce rates needing optimization, content gaps
  • Operational Excellence: Dedup log analysis, unsubscribe trends, delivery failures

Infrastructure and Storage

Campaign logs location: S3 bucket-based dedup and campaign tracking. The blast script in /Users/cb/Documents/repos/tools reads/writes campaign logs that track which contacts received which campaigns, preventing duplicate sends.

CSV contact sources: Constant Contact exports land as CSVs in a known location. The blast script reads from these CSVs, cross-references against the campaign dedup log, and prepares send payloads.

Dashboard infrastructure: The progress dashboard is a static SPA served via CloudFront with origin at the dashboard S3 bucket. Hash-based routing enables deep linking to specific cards without page reloads—critical for sharing findings with the team.

Key Technical Decisions

Why service account OAuth instead of user credentials? Service accounts authenticate programmatically without requiring interactive login. They scale for automated jobs and can be audited via Google Cloud IAM. User credentials require periodic re-authentication and are fragile in CI/CD contexts.

Why GA Data API v1 instead of Reporting API v4? v1 is the modern API with better filtering, dimensional breakdowns, and support for the latest GA4 event schema. v4 is deprecated for new projects.

Why orchestrator-driven card generation? Centralizing report generation in an orchestrator agent keeps logic DRY, allows for multi-step workflows (pull GA data → check campaigns → generate recommendations), and enables easy updates when business logic changes.

Why hash-based deep linking on the dashboard? Hash routing works in static SPAs without server-side routes. It allows deep linking to specific cards without page refreshes, enabling team members to share findings via URL.

What's Next

Three immediate items