Intelligent OperationsCase Studies

The Content Pipeline: From Notion to Published in Four Stages

Connect, Source, Creative, Publish. Architecture and implementation.

The Prompt Engineering Project February 18, 2025 10 min read

Quick Answer

Content pipeline automation connects ideation, drafting, editing, approval, and publishing into a streamlined workflow. Modern pipelines use AI for first-draft generation, database-driven editorial calendars, structured approval stages, and API-based multi-channel distribution. Automation eliminates bottlenecks between creation and publication while maintaining quality through human review checkpoints and brand-voice validation.

Content operations at scale require a pipeline, not a workflow. The distinction matters. A workflow is a sequence of manual steps that a person executes. A pipeline is an automated system where content flows through defined stages, with each stage performing a specific transformation. The Prompt Engineering Project publishes content through a four-stage pipeline that moves articles from Notion to the public web: Connect, Source, Creative, and Publish. This article documents the architecture, implementation, and lessons from building that pipeline.

Stage One: Connect

The Connect stage establishes the integration between the pipeline and Notion's API. Notion is the content management system -- all articles, metadata, and editorial status live in Notion databases. The pipeline needs to query those databases, handle pagination, respect rate limits, and recover from transient failures.

Notion's API enforces a rate limit of three requests per second. For a single content pull, this is rarely a constraint. For a full pipeline run that queries multiple databases, fetches block content for dozens of articles, and retrieves property metadata, you hit the limit within seconds. The pipeline implements a rate limiter that queues requests and dispatches them at the maximum safe cadence.

notion-client.ts
import { Client } from '@notionhq/client'

class RateLimiter {
  private queue: Array<() => void> = []
  private processing = false
  private readonly interval: number

  constructor(requestsPerSecond: number) {
    this.interval = Math.ceil(1000 / requestsPerSecond)
  }

  async schedule<T>(fn: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      this.queue.push(async () => {
        try {
          resolve(await fn())
        } catch (err) {
          reject(err)
        }
      })
      this.processQueue()
    })
  }

  private async processQueue() {
    if (this.processing) return
    this.processing = true
    while (this.queue.length > 0) {
      const next = this.queue.shift()!
      await next()
      await new Promise((r) => setTimeout(r, this.interval))
    }
    this.processing = false
  }
}

const limiter = new RateLimiter(3)
const notion = new Client({ auth: process.env.NOTION_API_KEY })

export async function queryDatabase(databaseId: string, filter?: any) {
  const results: any[] = []
  let cursor: string | undefined

  do {
    const response = await limiter.schedule(() =>
      notion.databases.query({
        database_id: databaseId,
        filter,
        start_cursor: cursor,
        page_size: 100,
      })
    )
    results.push(...response.results)
    cursor = response.has_more ? response.next_cursor! : undefined
  } while (cursor)

  return results
}

The pagination loop is essential. Notion's API returns a maximum of 100 results per request. A content database with more than 100 entries requires multiple paginated requests, each respecting the rate limit. The queryDatabase function handles this transparently -- callers receive a complete result set regardless of how many pages were required to assemble it.

Notion's rate limit is per-integration, not per-endpoint. A pipeline that queries multiple databases in parallel will hit the limit faster than one that queries them sequentially. Serialize your database queries and parallelize only within a single result set.

Stage Two: Source

The Source stage transforms raw Notion API responses into normalized content records. Notion's data model is block-based -- each page is a tree of blocks, where each block has a type (paragraph, heading, code, image, callout) and type-specific content. This model is powerful for editing but awkward for downstream consumption. The Source stage flattens the block tree into a structured content record with extracted properties and rendered content.

source-stage.ts
interface ContentRecord {
  id: string
  slug: string
  title: string
  subtitle: string
  excerpt: string
  status: 'draft' | 'review' | 'approved' | 'published'
  pillar: string
  type: string
  tags: string[]
  author: string
  date: string
  wordCount: number
  readTime: number
  content: ContentBlock[]
  metadata: Record<string, unknown>
}

interface ContentBlock {
  type: 'paragraph' | 'heading' | 'code' | 'image' | 'callout' | 'quote'
  level?: number        // for headings
  language?: string     // for code blocks
  text: string
  annotations?: {
    bold: boolean
    italic: boolean
    code: boolean
  }
}

function normalizeRecord(notionPage: any): ContentRecord {
  const props = notionPage.properties
  return {
    id: notionPage.id,
    slug: extractPlainText(props.Slug),
    title: extractPlainText(props.Title),
    subtitle: extractPlainText(props.Subtitle),
    excerpt: extractPlainText(props.Excerpt),
    status: props.Status.select?.name?.toLowerCase() ?? 'draft',
    pillar: props.Pillar.select?.name ?? '',
    type: props.Type.select?.name ?? '',
    tags: props.Tags.multi_select?.map((t: any) => t.name) ?? [],
    author: extractPlainText(props.Author),
    date: props.Date.date?.start ?? '',
    wordCount: props.WordCount?.number ?? 0,
    readTime: props.ReadTime?.number ?? 0,
    content: [],  // populated by block extraction
    metadata: {},
  }
}

The status field deserves attention. Content moves through four lifecycle states: draft, review, approved, and published. Only content with a status of "approved" or "published" enters the Creative stage. Drafts and content under review are visible in the pipeline dashboard but are excluded from downstream processing. This gate prevents incomplete content from reaching the public web.

The status lifecycle is a quality gate, not a workflow preference. Content that has not been explicitly approved does not enter the pipeline. There are no exceptions and no overrides.

Stage Three: Creative

The Creative stage applies AI-assisted content enhancement to approved records. This is not AI-generated content. The articles are written by humans. The Creative stage handles formatting, consistency, SEO optimization, and metadata generation -- tasks that are mechanical, repetitive, and well-suited to automation.

1

Formatting normalization: consistent heading hierarchy, paragraph spacing, and list structure across all articles regardless of how the author formatted the original in Notion.

2

SEO meta generation: title tags, meta descriptions, Open Graph properties, and structured data derived from the article content and metadata.

3

Reading time calculation: word count divided by 238 words per minute, rounded up. This uses the accepted average for technical content, which is slower than general reading.

4

Excerpt extraction: if the author did not provide an excerpt, the Creative stage generates one from the first two paragraphs, constrained to 160 characters for search result display.

5

Cross-reference linking: identifies mentions of other articles in the pipeline and generates internal links automatically.

The Creative stage uses prompt-driven processing. Each enhancement task has a dedicated prompt template with explicit constraints. The SEO meta generation prompt, for example, specifies maximum character lengths, forbidden words, and the requirement that the description must be a complete sentence, not a fragment. These constraints produce consistent output across hundreds of articles without manual review of each result.

creative-stage.ts
async function generateMetaTags(record: ContentRecord) {
  const prompt = `Generate SEO metadata for the following article.

Title: ${record.title}
Excerpt: ${record.excerpt}
Content (first 500 words): ${record.content.slice(0, 3).map(b => b.text).join(' ')}

Requirements:
- meta_title: max 60 characters, include primary keyword
- meta_description: max 155 characters, complete sentence
- og_title: can differ from meta_title, max 70 characters
- keywords: array of 5-8 relevant terms, no generic words

Return JSON only. No explanation.`

  const result = await llm.generate({
    model: 'claude-sonnet',
    prompt,
    temperature: 0.2,
    max_tokens: 300,
  })

  return JSON.parse(result.text)
}

Stage Four: Publish

The Publish stage takes the enhanced content records and generates the static pages that serve the public site. The PEP blog uses Next.js static generation, which means each article is rendered to HTML at build time and served from a CDN. There is no server-side rendering on each request. The result is pages that load in under 200 milliseconds from any location.

The publish process handles three specific concerns: static page generation, incremental rebuilds, and URL management.

Static Generation

Each content record maps to a page.tsx file in the blog directory. The blog data registry -- a TypeScript file containing metadata for every article -- is the single source of truth for the build system. When a new article is approved and enters the Publish stage, its metadata is added to the registry and its page.tsx file is generated. The next build picks up both changes automatically.

Incremental Rebuilds

A full site rebuild is triggered only when structural changes occur -- new routes, changed layouts, updated shared components. Content updates within existing pages use incremental static regeneration. The revalidation window is set to one hour, which means content updates are live within sixty minutes of the pipeline run without requiring a full deployment.

URL Management

URLs are permanent. Once an article is published at a URL, that URL never changes. If an article is retitled, the slug remains the same. If an article is reorganized into a different pillar, the URL remains the same. Redirects are added only when content is merged or permanently removed, and they are 301 redirects that preserve search engine authority.

URL permanence is a contract with search engines, link aggregators, and every person who has bookmarked or shared the link. Breaking URLs is breaking trust. Avoid it at all costs.

The Feed Builder UI

The pipeline is operated through a feed builder interface that provides visibility into every stage. The dashboard shows the current state of all content records: how many are in draft, how many are under review, how many are approved and waiting for the next pipeline run, and how many are published. Each record can be inspected to see its full transformation history -- what came from Notion, what the Source stage extracted, what the Creative stage enhanced, and what was published.

Error recovery is built into the interface. When a pipeline stage fails for a specific record -- a Notion API timeout, a malformed block structure, a failed AI enhancement -- the record is flagged with the failure reason and excluded from downstream stages. The operator can inspect the error, fix the source content in Notion, and re-run the pipeline for that specific record without reprocessing the entire database.

pipeline-runner.ts
interface PipelineResult {
  stage: 'connect' | 'source' | 'creative' | 'publish'
  recordId: string
  status: 'success' | 'failed' | 'skipped'
  error?: string
  duration: number
  timestamp: string
}

async function runPipeline(options: { recordIds?: string[] }) {
  const results: PipelineResult[] = []

  // Stage 1: Connect
  const rawPages = await connectStage(options.recordIds)

  // Stage 2: Source
  for (const page of rawPages) {
    try {
      const record = await sourceStage(page)
      if (record.status !== 'approved' && record.status !== 'published') {
        results.push({
          stage: 'source', recordId: record.id,
          status: 'skipped', duration: 0,
          timestamp: new Date().toISOString(),
        })
        continue
      }

      // Stage 3: Creative
      const enhanced = await creativeStage(record)

      // Stage 4: Publish
      await publishStage(enhanced)
      results.push({
        stage: 'publish', recordId: record.id,
        status: 'success', duration: Date.now(),
        timestamp: new Date().toISOString(),
      })
    } catch (err) {
      results.push({
        stage: 'source', recordId: page.id,
        status: 'failed', error: String(err),
        duration: 0, timestamp: new Date().toISOString(),
      })
    }
  }
  return results
}

The pipeline processes approximately forty articles in a full run. Average run time is under three minutes, with the Connect stage consuming the majority of that time due to rate limiting. The Creative stage runs in parallel across approved records, and the Publish stage is a single build command that generates all static pages simultaneously.

A pipeline is not a script you run manually. It is a system that runs on a schedule, handles its own errors, and produces consistent output regardless of who triggers it or when.

What We Would Change

Two things. First, we would add a preview stage between Creative and Publish. Currently, enhanced content goes directly to static generation. A preview stage that renders the article in an isolated environment and presents it for human review before publishing would catch formatting issues that the automated pipeline misses -- particularly around code block rendering, image placement, and heading hierarchy.

Second, we would implement semantic diffing between pipeline runs. Currently, every approved record is reprocessed on every run, regardless of whether its content changed. A diff mechanism that detects which records actually changed since the last run would reduce processing time and AI token costs significantly. For forty articles, the cost is manageable. For four hundred, it would not be.


Key Takeaways

1

A content pipeline has four stages: Connect (API integration), Source (data normalization), Creative (AI-assisted enhancement), and Publish (static generation and deployment).

2

Rate limiting is mandatory for Notion API integration. Three requests per second, with queued dispatch and pagination handling.

3

The status lifecycle (draft, review, approved, published) is a quality gate that prevents incomplete content from reaching the public web.

4

AI-assisted enhancement handles formatting, SEO, and cross-referencing -- mechanical tasks that benefit from consistency, not creativity.

5

URLs are permanent contracts. Slugs do not change when titles change. Redirects are added only for merges or removals.

6

Error recovery must be per-record, not per-run. A single failure should not block the rest of the pipeline.

Frequently Asked Questions

Common questions about this topic

AI Cost Management: A Framework for Token Budgets at ScaleBuilding a Design System Documentation Site

Related Articles

Intelligent Operations

From Experiment to Production: An AI Operations Checklist

Evaluation framework, prompt versioning, monitoring, cost tracking, fallback strategies, security review, documentation,...

Intelligent Operations

Prompt Libraries as Business Infrastructure

Prompts aren't one-off messages. They're structured business assets with versioning, ownership, and measurable ROI. 68 p...

Intelligent Operations

Observability for AI Systems: What to Monitor and Why

AI systems fail differently than traditional software. Here's what to monitor: latency, tokens, quality scores, error ra...

All Articles