Building an Automated GA4 Audit Pipeline with Dashboard Deep-Linking and Multi-Site Traffic Analysis

What We Built

We implemented a comprehensive Google Analytics 4 audit system that automatically scans all HTML across multiple properties, pulls the last 30 days of traffic data via the GA4 Data API, feeds findings into an orchestrator service, and surfaces results on a real-time kanban dashboard with deep-linkable card references. The system identified GA tracking code gaps, connected them to specific file paths, and generated actionable operational excellence recommendations.

The Problem We Solved

Before this work, we had no programmatic visibility into:

  • Which pages across our multi-site portfolio actually had GA tracking code deployed
  • What GA properties were being used and which ones were orphaned
  • Real traffic patterns over the last month across all platforms
  • Which sites had operational gaps (missing tracking, inconsistent deployments, etc.)
  • A centralized place to surface audit findings that persists beyond console logs

The manual alternative would have required diffing hundreds of HTML files by hand. We needed automation that could run as a background job and surface results on our progress dashboard.

Technical Architecture

Phase 1: HTML GA Code Audit

We built a recursive file scanner that walks the repository tree for all HTML templates across sites:


# Pseudo-structure of audit sweep
scan_directories = [
  '/Users/cb/Documents/repos/sites/jada/',
  '/Users/cb/Documents/repos/sites/qos/',
  '/Users/cb/Documents/repos/sites/tools/'
]

for directory in scan_directories:
  for root, dirs, files in walk(directory):
    for file in files:
      if file.endswith('.html'):
        content = read_file(os.path.join(root, file))
        ga_properties = extract_ga_tags(content)
        if not ga_properties:
          flag_as_missing_tracking(file_path)
        else:
          validate_property_ids(ga_properties)

The scanner extracts GA property IDs from all <script> blocks containing gtag or ga4 configurations. It logs the exact file path, line number, and property ID for each match, then flags files with zero GA code.

Phase 2: GA4 Data API Integration

We created /Users/cb/Documents/repos/tools/reauth_ga.py to handle OAuth2 token refresh and programmatic GA4 API calls. The script:

  • Loads an existing service account credential file (stored securely outside the repo)
  • Authenticates with Google's OAuth2 endpoints
  • Calls the GA4 Data API (v1) with the property ID from our audit
  • Pulls aggregated metrics for the last 30 days: pageviews, sessions, users, bounce rate, engagement rate
  • Returns structured JSON for downstream processing

Command pattern:


python reauth_ga.py \
  --property-id 12345678 \
  --date-range 30 \
  --metrics pageviews,users,sessions \
  --dimensions pagePath,deviceCategory

The script handles token expiration gracefully by checking the token's expires_in field and re-authenticating only when needed, reducing API call overhead.

Phase 3: Orchestrator Integration

The orchestrator service (running as a background agent) receives the audit results and GA data, then:

  • Correlates traffic data with file locations (which pages are tracked vs. untracked)
  • Identifies property ID inconsistencies (same site using multiple GA properties)
  • Generates a structured report with sections for gaps, recommendations, and traffic trends
  • Pushes each finding as a kanban card to the dashboard

This decouples the audit from the reporting layer—audits can run on a schedule without blocking the dashboard, and multiple audit runs accumulate historical data on the board.

Phase 4: Dashboard Deep-Linking

We leveraged the dashboard's existing hash-based navigation to create persistent, shareable deep links to specific audit cards:

Deep Link Format: https://progress.queenofsandiego.com/#card-{id}

Example: https://progress.queenofsandiego.com/#card-t-31aa2593

The dashboard's JavaScript router (in /assets/js/router.js) already listened for hash changes and rendered the correct card detail view. We only needed to ensure the orchestrator output card IDs in a format matching the existing pattern (t- prefix + unique identifier).

Key Implementation Details

GA Property ID Inventory

We mapped all unique GA property IDs across the portfolio and documented their site assignments:


Property ID 123456789 → queenofsandiego.com (QOS)
Property ID 234567890 → jada-booking.com (JADA)
Property ID 345678901 → tools.sailjada.com (Tools)

This inventory was stored in memory files at:

  • /Users/cb/.claude/projects/-Users-cb-Documents-repos/memory/MEMORY.md
  • /Users/cb/.claude/projects/-Users-cb-Documents-repos/memory/feedback_dashboard_deep_links.md

Why separate memory files? The first file is our agent state (persistent across sessions), while the second captures dashboard-specific syntax we discovered during implementation. This way, future audits can reference the correct deep-link format without re-discovering it.

Service Account OAuth Flow

Instead of user-based OAuth (which requires manual re-approval every 7 days for out-of-sandbox scopes), we used a Google service account with domain-wide delegation:

  1. Service account JSON credential stored securely outside version control
  2. Script reads credential file and builds a JWT
  3. JWT is exchanged for an access token via https://oauth2.googleapis.com/token
  4. Access token is cached with its expiration timestamp
  5. Subsequent GA API calls use the cached token until expiration
  6. On expiration, the script automatically requests a fresh token (no manual re-auth needed)

This unblocks scheduled audits—the audit can run daily without human intervention, which was the bottleneck before.

Why This Architecture?

  • Decoupled pipeline: Audit, API calls, and reporting are separate stages. Each can fail independently and be retried without corrupting the others.
  • Persistent results: Cards on the dashboard survive beyond a single console session. Engineers can review findings asynchronously.
  • Deep links: Specific findings are shareable. We can send Slack messages like "Check the Mother's Day blast status: https://progress.queenofsandiego.com/#card-t-31aa2593" without ambiguity.
  • Service account auth: No manual re-approval every 7 days. The system runs unattended on a cron job if needed.

Immediate