NEXUS deploys eight specialized agents equipped with real tools — RFM clustering, spam scoring, chi-square tests — that cycle on failure, run in parallel, and accumulate cross-campaign memory.
Built by Abhay Agarwal · MNNIT Allahabad · FrostHack
$ langgraph run --brief "Winter sale · urban shoppers · ₹1500+ AOV"
LangGraph routes based on what agents compute — not a fixed sequence. The Quality Gate cycles back to Strategy on failure. Human rejection routes back to Content Gen with feedback injected into state.
Tools are deterministic Python functions — not LLM calls. The LLM decides which tool to invoke; Python executes it and returns real data.
Extracts product, audience, goals and CTA from natural language. GPT-4 with schema validation.
Real RFM scoring + sklearn KMeans on actual customer data. Validates segment sizes for A/B significance.
Plans send timing, budget allocation, A/B test design. Routes back through the Quality Gate on failure.
Pure deterministic checks — no LLM. Validates sizes, budget math, spacing. Routes to Strategy on failure.
Fan-out via LangGraph Send API. Parallel variants per segment, grounded in spam scores and CTR predictions.
LangGraph interrupt() pauses the graph, serializes state to MongoDB, resumes from checkpoint on approval.
Interfaces with Campaign & Email APIs. Logs recipient counts, timestamps, and delivery status into shared state.
Event-driven via Redis Streams. Chi-square significance tests before routing underperformers to Content Gen.
Three architectural patterns that separate NEXUS from a sequential LLM pipeline — implemented with real code, not described in a prompt.
Instead of generating content for each segment sequentially, NEXUS uses LangGraph's Send API to fan out one Content Gen task per segment concurrently. Four segments in 15 seconds instead of 60. Architecturally correct — segments are independent.
Run-level state is checkpointed to MongoDB after every node — the graph survives server restarts. Cross-campaign pgvector store captures what worked and why, queried by semantic similarity when new campaigns start.
Every edge has a condition. Quality Gate checks segment sizes, budget math, send-time spacing — all deterministic, no LLM. Failure routes back to Strategy (max 3 retries). Human rejection routes back to Content Gen with the feedback injected into state.
Every node, every LLM call, every tool call traced with latencies and token counts. Detects prompt injection, measures per-tenant costs, and surfaces pipeline bottlenecks. Makes "how do you know it's working?" answerable with data.
Multi-tenancy, federated learning, fine-tuned models, adversarial debate, autonomous monitoring — problems that only exist at scale. Each one a genuine architectural addition, not a feature flag.
Complete data isolation per tenant. Abstracted patterns — normalized metrics, structural content patterns, anonymized audience tier signals — are aggregated nightly into a shared knowledge base. New tenants benefit from platform-wide learnings without accessing any existing customer's data.
Continuous Redis Streams subscriber — not a polling job. Reacts to unsubscribe spikes, out-of-stock events, competitor campaign detections, and viral moments in real time. Each signal type has a response playbook with ordered actions by urgency and reversibility.
Llama 3.1 8B fine-tuned with LoRA adapters on actual campaign open rates and CTRs — not human preference labels. Deployed after 20+ campaigns, versioned in MLflow. Training labels are real behavioral outcomes, not a human rater's opinion on what sounds good.
High-stakes campaigns (large budget, large audience, new category) route through a four-phase deliberation: Strategist proposes → Devil's Advocate critiques with evidence → Strategist revises → Risk Assessment Agent scores across brand safety, audience risk, financial risk, and compliance.
Top-level agent that accepts plain-language instructions and orchestrates any combination of agents, database operations, and scheduled tasks. Uses confidence thresholds for ambiguity resolution — proceeds autonomously above 0.85, asks one clarifying question below 0.60.
Continuously measures four dimensions: task completion rate, decision accuracy (predicted vs actual CTR), hallucination rate (unverified factual claims from memory), and cost efficiency per node. Generates weekly reports. Answers "how do you know it's working?" with data.
The full pipeline — RFM clustering, quality gates, parallel content generation, human approval, statistical optimization — runs end to end from a plain-language brief.