# HeadlineSift.com — Technical Product Specification

**Version:** 1.0.0 — MVP
**Date:** 2026-06-09
**Author:** HeadlineSift Team
**Stack:** Next.js App Router · TypeScript · Tailwind CSS · SQLite · Prisma · Zod
**Hosting:** GoDaddy VPS · WHM · cPanel · Node.js v22

---

## Table of Contents

1. [Product Overview](#1-product-overview)
2. [Target Users](#2-target-users)
3. [User Roles](#3-user-roles)
4. [Public User Flows](#4-public-user-flows)
5. [Admin User Flows](#5-admin-user-flows)
6. [Complete Feature List](#6-complete-feature-list)
7. [SQLite-Based Architecture](#7-sqlite-based-architecture)
8. [WHM/cPanel Deployment Architecture](#8-whmcpanel-deployment-architecture)
9. [Directory Structure](#9-directory-structure)
10. [Database Entities](#10-database-entities)
11. [Database-Backed Job System](#11-database-backed-job-system)
12. [Local Script Strategy](#12-local-script-strategy)
13. [cPanel Cron Strategy](#13-cpanel-cron-strategy)
14. [Deduplication Strategy](#14-deduplication-strategy)
15. [Story Clustering Strategy](#15-story-clustering-strategy)
16. [Ranking Strategy](#16-ranking-strategy)
17. [AI Analysis Strategy](#17-ai-analysis-strategy)
18. [Admin Moderation Workflow](#18-admin-moderation-workflow)
19. [Source Management Workflow](#19-source-management-workflow)
20. [Source Health Monitoring](#20-source-health-monitoring)
21. [CSV Import/Export Plan](#21-csv-importexport-plan)
22. [SQLite Backup and Restore Strategy](#22-sqlite-backup-and-restore-strategy)
23. [SQLite Limitations and Write Conflict Avoidance](#23-sqlite-limitations-and-write-conflict-avoidance)
24. [Future Migration Path to PostgreSQL](#24-future-migration-path-to-postgresql)
25. [Security Considerations](#25-security-considerations)
26. [Legal/Disclaimer Considerations](#26-legaldisclaimer-considerations)
27. [MVP Milestones](#27-mvp-milestones)
28. [Folder Structure Recommendation](#28-folder-structure-recommendation)
29. [Environment Variables](#29-environment-variables)
30. [Acceptance Criteria](#30-acceptance-criteria)

---

## 1. Product Overview

**HeadlineSift.com** is an AI-powered headline intelligence platform. It fetches news from trusted public RSS, API, and official sources, then processes them through a pipeline of deduplication, clustering, ranking, and AI analysis. The result is a clean, single-page dashboard showing only the most important stories — each with a summary, impact analysis, confidence rating, and source attribution.

### Core Value Proposition

> *We filter thousands of headlines into the most important stories, with AI-powered summaries, impact analysis, and source confidence.*

### Core Pipeline

```
Fetch articles/headlines from trusted sources
        ↓
Normalize raw articles
        ↓
Remove exact duplicates
        ↓
Group similar stories into clusters
        ↓
Rank clusters by importance
        ↓
Select top N per country/category
        ↓
Generate AI analysis (summary, why it matters, impact)
        ↓
Admin reviews and publishes
        ↓
Public users see clean headline dashboard
```

---

## 2. Target Users

### Primary: News-Conscious General Public
- People who want curated, high-signal news without doom-scrolling
- Multi-country readers (India, US, UK, Japan initially)
- People who want to understand **why** a story matters and **how** it affects them

### Secondary: Professionals Who Need News Intelligence
- Business professionals tracking market/economic news
- Students and educators tracking education/career updates
- Health-conscious individuals tracking medical/science news
- Tech professionals tracking industry developments

### Tertiary: Content Curators
- The admin team who manages sources, reviews AI output, and publishes stories
- Future: power users who may want custom feeds

---

## 3. User Roles

| Role | Description | Permissions |
|------|-------------|-------------|
| **Public Visitor** | Unauthenticated user | View published stories, filter by country/category, click through to original sources |
| **Admin** | Authenticated administrator | Full access to admin panel — manage countries, categories, sources, mappings, review/publish/hide stories, trigger fetches, view logs, run backups, import/export CSV, configure settings |
| **Super Admin** (future) | Multi-admin support | Same as Admin plus: manage other admin accounts, change system-critical settings |

For MVP, only a single Admin role is needed. Authentication uses a simple secure token-based session with bcrypt-hashed password stored in a `settings` table.

---

## 4. Public User Flows

### Flow 1: Browse Headlines (Default View)
```
Visitor lands on HeadlineSift.com
        ↓
Sees top-ranked stories for "Global / All Categories"
        ↓
Scrolls through story cards
        ↓
Each card shows: headline, summary, why it matters, positive/negative impact,
                 impact level, confidence level, source count, source names,
                 last updated timestamp, "Read original" link
        ↓
Clicks "Read original" → opens source article in new tab
```

### Flow 2: Filter by Country
```
Visitor selects a country from the country filter bar
        ↓
Page reloads with stories for that country (across all categories)
        ↓
Visitor can further refine by category
```

### Flow 3: Filter by Category
```
Visitor selects a category (e.g., "Technology")
        ↓
Page shows top stories in that category for the currently selected country
        ↓
Global categories show worldwide stories; country categories show country-specific
```

### Flow 4: Change Sort Order
```
Visitor changes sort from "Top Ranked" (default) to:
  - Latest → sorted by publication time
  - Most Covered → sorted by source count
  - High Impact → sorted by impact level
```

### Flow 5: Time Window Filter
```
Visitor selects time window: Today, Last 3 Days, This Week, This Month
        ↓
Only stories first seen within that window are displayed
```

### Flow 6: Confidence/Impact Filter
```
Visitor sets confidence filter to "High only"
        ↓
Only stories with High confidence are shown
        ↓
Visitor sets impact filter to "High only"
        ↓
Only stories with High impact are shown
```

### Flow 7: No Results State
```
Filters return zero stories
        ↓
Message: "No stories match your filters. Try broadening your selection."
        ↓
Button: "Reset all filters"
```

### Flow 8: Error State
```
If the public page fails to load stories
        ↓
Message: "Unable to load stories right now. Please try again."
        ↓
Retry button available
```

---

## 5. Admin User Flows

### Flow 1: Login
```
Admin navigates to /admin/login
        ↓
Enters password
        ↓
On success → redirected to /admin/dashboard
        ↓
On failure → error message, rate-limited after 5 attempts
        ↓
Session lasts 24 hours, then requires re-login
```

### Flow 2: Dashboard Overview
```
Admin lands on dashboard
        ↓
Sees metrics cards:
  - Total sources / Active / Failed
  - Articles fetched today
  - Duplicates found
  - Story clusters created
  - AI analyses generated
  - Stories pending review
  - Stories published
  - Top categories (by story count)
  - Top broken sources
```

### Flow 3: Manage Countries
```
Admin → Countries
        ↓
Table: Name, Code, Region, Language, Status, Display Order
        ↓
Actions: Add, Edit, Toggle Status, Delete (soft)
        ↓
Add/Edit form:
  - Country name (required)
  - Country code (required, e.g., IN, US, GB, JP)
  - Region (required, e.g., Asia, North America, Europe)
  - Default language (required, e.g., English, Japanese)
  - Status: Active / Inactive
  - Display order (integer)
  - Is global option (checkbox — if checked, "Global" appears as a filter option using this country's settings as fallback)
```

### Flow 4: Manage Categories
```
Admin → Categories
        ↓
Table: Name, Slug, Level, Status, Display Order, Max Stories
        ↓
Actions: Add, Edit, Toggle Status, Delete (soft)
        ↓
Add/Edit form:
  - Category name (required)
  - Slug (auto-generated from name, editable)
  - Level: Global / Country / Both
  - Description (optional)
  - Status: Active / Inactive
  - Display order (integer)
  - AI safety level: Low / Medium / High / Critical
  - Max public stories (default 50)
```

### Flow 5: Manage Sources
```
Admin → Sources
        ↓
Table: Name, Type, Country, Trust Score, Status, Last Fetched, Health
        ↓
Actions: Add, Edit, Toggle Status, Delete (soft), Test Feed, Fetch Now
        ↓
Add/Edit form:
  - Source name (required)
  - Website URL (required)
  - Source type: RSS / API / Manual
  - Feed endpoint URL (for RSS)
  - API endpoint URL (for API)
  - API key reference (stored in .env, never exposed in UI)
  - Language (required)
  - Trust score (1–10, default based on source type)
  - Status: Active / Inactive / Paused / Blocked
  - Fetch frequency in minutes (default varies by category)
  - Usage notes (free text)
  - Country (primary country of coverage)
```

### Flow 6: Manage Source Mappings
```
Admin → Source Mappings
        ↓
Each mapping links: one Source → one Country → one Category
        ↓
A single source can have multiple mappings
        ↓
Table: Source Name, Country, Category, Priority, Status
        ↓
Actions: Add Mapping, Edit Priority, Remove Mapping, Bulk Create
        ↓
Quick-add: select source, select countries (multi), select categories (multi),
           system creates all combinations
```

### Flow 7: View Raw Articles
```
Admin → Fetched Articles
        ↓
Table: Title, Source, Country, Category, Published At, Fetched At, Status
        ↓
Statuses: New, Duplicate, Clustered, Ignored
        ↓
Filters: by source, country, category, status, date range
        ↓
Click article → detail view: full metadata, raw content snippet,
                cluster assignment, similarity matches
```

### Flow 8: View Story Clusters
```
Admin → Story Clusters
        ↓
Table: Canonical Title, Category, Country, Source Count, Rank Score,
       Impact Score, Confidence, Status
        ↓
Statuses: Pending Review, Approved, Published, Rejected, Hidden
        ↓
Filters: by status, category, country, date range
        ↓
Click cluster → detail view:
  - All member articles with similarity scores
  - AI analysis (if generated)
  - Rank breakdown (showing each scoring factor)
  - Action buttons: Approve, Reject, Hide, Edit AI Analysis, Publish
```

### Flow 9: Review and Moderate Stories
```
Admin → Review Queue
        ↓
Shows clusters with status "Pending Review" sorted by rank score
        ↓
For each cluster, admin can:
  - Read the AI-generated analysis
  - Edit any AI-generated field (summary, why it matters, etc.)
  - Check "Human Edited" flag (auto-set when admin edits)
  - Approve → status becomes "Approved"
  - Reject → status becomes "Rejected" (hidden from public, kept in DB)
  - Hide → status becomes "Hidden" (soft hide, reversible)
  - Publish → status becomes "Published" (visible on public page)
        ↓
Batch actions: Select multiple → Approve All / Reject All
        ↓
Auto-publish toggle per category (admin can enable for low-risk categories
  like Technology, Science where no manual review is needed)
```

### Flow 10: View Fetch Logs
```
Admin → Fetch Logs
        ↓
Table: Source, Started, Ended, Status, Articles Found, Saved, Duplicates, Error
        ↓
Statuses: Success, Partial, Failed
        ↓
Click log → detail: full error message, stack trace if applicable,
                articles fetched in that run
```

### Flow 11: View Jobs
```
Admin → Jobs
        ↓
Table: Job Type, Status, Started, Completed, Progress, Result Summary
        ↓
Job types visible: Fetch All, Cluster Stories, Generate AI, Cleanup, Backup
        ↓
Statuses: Pending, Running, Completed, Failed
```

### Flow 12: Ranking Rules
```
Admin → Ranking Rules
        ↓
View current scoring weights in a read-only table
        ↓
Edit button → form with all scoring factors and their weights
        ↓
Each factor: name, points, isEnabled toggle
        ↓
Save → stored in settings table as JSON
        ↓
Reset to defaults button
```

### Flow 13: Settings
```
Admin → Settings
        ↓
Form with grouped settings:
  - General: Site name, default country, default category, stories per page
  - Admin: Admin password (change), session timeout
  - AI: AI provider, API key, model name, max tokens, temperature,
         auto-publish threshold confidence
  - Fetching: Global fetch interval, max articles per source per fetch,
              request timeout, user-agent string
  - Display: Public page title, meta description, footer text
  - Moderation: Auto-publish categories (multi-select),
                require review categories (multi-select)
```

### Flow 14: CSV Import/Export
```
Admin → Import/Export
        ↓
Export:
  - Select entity: Sources, Categories, Source Mappings
  - Click Export → downloads CSV file
        ↓
Import:
  - Select entity: Sources
  - Upload CSV file
  - Preview: shows parsed rows with validation errors highlighted
  - Confirm Import → rows are created, errors reported
  - Download error report option
```

### Flow 15: SQLite Backup
```
Admin → Backups
        ↓
Table: Filename, Size, Created At, Type (Manual / Auto)
        ↓
Actions: Create Backup Now, Download, Delete
        ↓
Auto-backup status: enabled/disabled, frequency, retention count
```

---

## 6. Complete Feature List

### Public Website

| # | Feature | Priority |
|---|---------|----------|
| P1 | One-page headline dashboard | P0 |
| P2 | Country filter (Global, India, US, UK, Japan) | P0 |
| P3 | Category filter (7 categories) | P0 |
| P4 | Sort options (Top Ranked, Latest, Most Covered, High Impact) | P0 |
| P5 | Time window filter (Today, 3 Days, Week, Month) | P1 |
| P6 | Confidence filter (All, High only) | P1 |
| P7 | Impact level filter (All, High only) | P1 |
| P8 | Story cards with full AI analysis display | P0 |
| P9 | Source attribution and count on each card | P0 |
| P10 | "Read original" external link | P0 |
| P11 | Relative timestamps ("18 minutes ago") | P0 |
| P12 | Empty state messaging | P1 |
| P13 | Error state with retry | P1 |
| P14 | Loading skeleton cards | P1 |
| P15 | Responsive design (mobile/tablet/desktop) | P0 |
| P16 | SEO meta tags (dynamic per filter combination) | P1 |
| P17 | Static disclaimer pages (About, Contact, Privacy, Terms, Disclaimer) | P0 |
| P18 | Footer with legal links | P0 |

### Admin Panel

| # | Feature | Priority |
|---|---------|----------|
| A1 | Admin login with rate limiting | P0 |
| A2 | Dashboard with metrics | P0 |
| A3 | Countries CRUD | P0 |
| A4 | Categories CRUD | P0 |
| A5 | Sources CRUD | P0 |
| A6 | Source mappings CRUD + bulk create | P0 |
| A7 | Raw articles list with filters | P0 |
| A8 | Story clusters list with filters | P0 |
| A9 | Story cluster detail view with rank breakdown | P0 |
| A10 | Review queue with approve/reject/hide/publish | P0 |
| A11 | AI analysis viewer and inline editor | P0 |
| A12 | Batch approve/reject | P1 |
| A13 | Fetch logs with detail view | P0 |
| A14 | Jobs list with status tracking | P0 |
| A15 | Ranking rules configuration | P0 |
| A16 | Settings management (all config in DB) | P0 |
| A17 | CSV export (sources, categories, mappings) | P1 |
| A18 | CSV import with preview and validation | P1 |
| A19 | SQLite backup (manual + scheduled) | P0 |
| A20 | Backup download and delete | P0 |
| A21 | Source health dashboard | P0 |
| A22 | Test source feed (verify RSS/API before activating) | P1 |
| A23 | Trigger individual source fetch | P0 |
| A24 | Trigger full fetch-all job | P0 |
| A25 | Trigger recluster job | P1 |
| A26 | Trigger AI analysis job (single cluster or batch) | P1 |
| A27 | Admin session timeout (24h) | P0 |

### Backend / System

| # | Feature | Priority |
|---|---------|----------|
| S1 | RSS feed fetcher with timeout and error handling | P0 |
| S2 | API source fetcher with provider abstraction | P0 |
| S3 | Article normalization (title, URL, content hash) | P0 |
| S4 | Exact duplicate detection (URL hash + title hash + content hash) | P0 |
| S5 | Near-duplicate detection (title similarity) | P0 |
| S6 | Story clustering (cosine similarity on TF-IDF or embedding) | P0 |
| S7 | Ranking engine with configurable scoring weights | P0 |
| S8 | AI provider abstraction layer | P0 |
| S9 | AI analysis generation (summary, why it matters, impacts, confidence) | P0 |
| S10 | AI analysis caching (do not regenerate unless cluster changes) | P0 |
| S11 | AI safety rules per category | P0 |
| S12 | Database-backed job system (job queue in SQLite) | P0 |
| S13 | Fetch scheduler via cPanel cron | P0 |
| S14 | Source health monitoring (consecutive failures, error rate) | P0 |
| S15 | SQLite backup rotation (local fs, outside public_html) | P0 |
| S16 | Logging system (structured logs to files outside public_html) | P0 |
| S17 | WAL mode for SQLite (better concurrency) | P0 |
| S18 | Write serialization for SQLite (one write at a time) | P0 |
| S19 | Request validation with Zod on all API routes | P0 |
| S20 | CSRF protection on admin actions | P0 |

---

## 7. SQLite-Based Architecture

### Why SQLite for MVP

SQLite is the correct choice for this MVP because:
- **Zero infrastructure**: No separate database server to install, configure, or maintain
- **Self-contained**: Single file, easy to back up and restore
- **Sufficient performance**: SQLite handles 100K+ rows easily; MVP volume is well within limits
- **cPanel compatible**: Works on GoDaddy VPS without additional services
- **Prisma support**: Full Prisma ORM support with SQLite provider
- **Simplifies deployment**: Everything self-contained in one Node.js process

### Architecture Diagram

```
┌─────────────────────────────────────────────────────────┐
│                    GoDaddy VPS (WHM)                     │
│  ┌───────────────────────────────────────────────────┐  │
│  │              cPanel: headlinesift                   │  │
│  │                                                    │  │
│  │  ┌──────────────┐    ┌─────────────────────────┐  │  │
│  │  │  cPanel Cron  │───▶│  /home/headlinesift/     │  │  │
│  │  │  (scheduler)  │    │  headlinesift/           │  │  │
│  │  └──────────────┘    │  ├── scripts/             │  │  │
│  │                       │  │   ├── fetch-all.ts    │  │  │
│  │  ┌──────────────┐    │  │   ├── cluster.ts      │  │  │
│  │  │  Next.js App  │    │  │   ├── ai-analyze.ts   │  │  │
│  │  │  (PM2/Node)   │    │  │   ├── cleanup.ts     │  │  │
│  │  │               │    │  │   └── backup.ts      │  │  │
│  │  │  / (public)   │    │  ├── data/               │  │  │
│  │  │  /admin/*     │    │  │   └── headlinesift.db │  │  │
│  │  │  /api/*       │    │  ├── backups/            │  │  │
│  │  └──────┬────────┘    │  ├── logs/               │  │  │
│  │         │              │  ├── .env                │  │  │
│  │         │              │  └── (next.js app files) │  │  │
│  │         │              └─────────────────────────┘  │  │
│  │         │                                           │  │
│  │  ┌──────▼────────┐                                 │  │
│  │  │  public_html/  │                                 │  │
│  │  │  (static only) │                                 │  │
│  │  └───────────────┘                                 │  │
│  └───────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘
```

### Data Flow

```
External Sources (RSS/API)         Admin Browser            Public Browser
        │                               │                        │
        ▼                               ▼                        ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Next.js Server                              │
│                                                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌───────────────┐  │
│  │ Fetcher  │  │Clusterer │  │  Ranker  │  │ AI Analyzer   │  │
│  │ Service  │  │ Service  │  │ Service  │  │   Service     │  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └───────┬───────┘  │
│       │              │              │                │          │
│       └──────────────┼──────────────┼────────────────┘          │
│                      │              │                            │
│                      ▼              ▼                            │
│              ┌──────────────────────────┐                       │
│              │     Prisma ORM           │                       │
│              │  (SQLite WAL mode)       │                       │
│              └────────────┬─────────────┘                       │
│                           │                                      │
│                           ▼                                      │
│              ┌──────────────────────────┐                       │
│              │  /home/headlinesift/     │                       │
│              │  headlinesift/data/      │                       │
│              │  headlinesift.db         │                       │
│              └──────────────────────────┘                       │
└─────────────────────────────────────────────────────────────────┘
```

### WAL Mode

SQLite will run in **WAL (Write-Ahead Logging)** mode. This allows:
- Concurrent reads while a write is in progress
- Better performance for read-heavy workloads (public page reads while admin writes)
- Single writer at a time (no change — SQLite always serializes writes)

Enabled via Prisma datasource configuration:
```
PRAGMA journal_mode=WAL;
PRAGMA busy_timeout=5000;
PRAGMA foreign_keys=ON;
```

---

## 8. WHM/cPanel Deployment Architecture

### Hosting Environment

| Component | Provider/Software |
|-----------|-------------------|
| VPS | GoDaddy VPS |
| Server Management | WHM (Web Host Manager) |
| Account Management | cPanel (headlinesift account) |
| Web Server | Apache (cPanel-managed) |
| Node.js | v22.22.3 (cPanel Node.js selector) |
| Process Manager | PM2 (installed via npm in user space) |
| Database | SQLite (file-based, no server needed) |
| Scheduling | cPanel Cron Jobs |
| Email | cPanel email (optional, for alerts) |

### Application Serving Strategy

**Option A: PM2 + Reverse Proxy (Recommended)**

```
Apache (cPanel) ──reverse proxy──▶ PM2 running Next.js on localhost:3000
```

1. Create a cPanel Reverse Proxy from the domain to `localhost:3000`
2. PM2 runs `node server.js` (Next.js built output) as a daemon
3. PM2 auto-restarts on crash, auto-starts on server reboot

**Configuration:**
- PM2 managed via `ecosystem.config.js` in project root
- PM2 log output directed to `/home/headlinesift/headlinesift/logs/`
- Startup script: `pm2 startup` + `pm2 save`
- cPanel Reverse Proxy configured via `.htaccess` or cPanel UI

**Option B: cPanel Node.js Application Manager (Fallback)**

cPanel includes a Node.js application manager that can:
- Register the Next.js application
- Set environment variables
- Manage start/stop/restart
- This is simpler but less flexible than PM2

### Directory Permissions

```
/home/headlinesift/headlinesift/       → 755 (owner: headlinesift, group: headlinesift)
/home/headlinesift/headlinesift/data/  → 755
/home/headlinesift/headlinesift/data/headlinesift.db → 644
/home/headlinesift/headlinesift/backups/ → 755
/home/headlinesift/headlinesift/logs/    → 755
/home/headlinesift/headlinesift/.env     → 600 (owner read/write only)
/home/headlinesift/public_html/          → 750 (cPanel default, required for Apache)
```

### Why the App Lives Outside public_html

The entire Next.js application (source, node_modules, SQLite DB, .env, logs, backups) lives in `/home/headlinesift/headlinesift/`. Only public-facing static assets (if any) are served through public_html. This is a critical security boundary:
- `.env` with API keys and admin password hash is never web-accessible
- `headlinesift.db` is never web-accessible
- Logs and backups are never web-accessible
- Source code is never web-accessible

---

## 9. Directory Structure

```
/home/headlinesift/
├── public_html/                    # Apache document root
│   ├── .htaccess                   # Apache config (reverse proxy rules)
│   ├── index.php                   # Placeholder or proxy config
│   └── ...                         # Only what Apache needs to serve
│
├── headlinesift/                   # Application root (OUTSIDE public_html)
│   ├── .env                        # Environment variables (chmod 600)
│   ├── .env.example                # Template for .env
│   ├── .gitignore
│   ├── package.json
│   ├── tsconfig.json
│   ├── next.config.ts
│   ├── tailwind.config.ts
│   ├── postcss.config.mjs
│   ├── ecosystem.config.js         # PM2 configuration
│   │
│   ├── prisma/
│   │   ├── schema.prisma           # SQLite schema
│   │   └── migrations/             # Prisma migration files
│   │
│   ├── data/                       # SQLite database (gitignored)
│   │   └── headlinesift.db
│   │
│   ├── backups/                    # SQLite backups (gitignored)
│   │   └── backup-YYYY-MM-DD-HHmmss.db
│   │
│   ├── logs/                       # Application logs (gitignored)
│   │   ├── app.log                 # Next.js app logs
│   │   ├── fetch.log               # Fetch job logs
│   │   ├── cluster.log             # Clustering job logs
│   │   ├── ai.log                  # AI analysis job logs
│   │   └── error.log               # Error logs
│   │
│   ├── src/
│   │   ├── app/                    # Next.js App Router pages
│   │   │   ├── layout.tsx          # Root layout
│   │   │   ├── page.tsx            # Public homepage (headline dashboard)
│   │   │   ├── loading.tsx         # Loading skeleton
│   │   │   ├── error.tsx           # Error boundary
│   │   │   ├── not-found.tsx       # 404 page
│   │   │   │
│   │   │   ├── about/
│   │   │   │   └── page.tsx        # About page
│   │   │   ├── contact/
│   │   │   │   └── page.tsx        # Contact page
│   │   │   ├── privacy/
│   │   │   │   └── page.tsx        # Privacy policy
│   │   │   ├── terms/
│   │   │   │   └── page.tsx        # Terms of use
│   │   │   ├── disclaimer/
│   │   │   │   └── page.tsx        # Legal disclaimer
│   │   │   │
│   │   │   ├── admin/
│   │   │   │   ├── layout.tsx      # Admin layout (auth check, sidebar)
│   │   │   │   ├── page.tsx        # Redirect to dashboard
│   │   │   │   ├── login/
│   │   │   │   │   └── page.tsx    # Admin login page
│   │   │   │   ├── dashboard/
│   │   │   │   │   └── page.tsx    # Admin dashboard
│   │   │   │   ├── countries/
│   │   │   │   │   └── page.tsx    # Countries CRUD
│   │   │   │   ├── categories/
│   │   │   │   │   └── page.tsx    # Categories CRUD
│   │   │   │   ├── sources/
│   │   │   │   │   └── page.tsx    # Sources CRUD
│   │   │   │   ├── mappings/
│   │   │   │   │   └── page.tsx    # Source mappings
│   │   │   │   ├── articles/
│   │   │   │   │   └── page.tsx    # Raw articles viewer
│   │   │   │   ├── clusters/
│   │   │   │   │   ├── page.tsx    # Story clusters list
│   │   │   │   │   └── [id]/
│   │   │   │   │       └── page.tsx # Cluster detail
│   │   │   │   ├── review/
│   │   │   │   │   └── page.tsx    # Review queue
│   │   │   │   ├── logs/
│   │   │   │   │   └── page.tsx    # Fetch logs
│   │   │   │   ├── jobs/
│   │   │   │   │   └── page.tsx    # Jobs list
│   │   │   │   ├── ranking/
│   │   │   │   │   └── page.tsx    # Ranking rules
│   │   │   │   ├── settings/
│   │   │   │   │   └── page.tsx    # Settings
│   │   │   │   ├── import-export/
│   │   │   │   │   └── page.tsx    # CSV import/export
│   │   │   │   └── backups/
│   │   │   │       └── page.tsx    # SQLite backup management
│   │   │   │
│   │   │   └── api/                # API routes
│   │   │       ├── stories/        # Public: get published stories
│   │   │       │   └── route.ts
│   │   │       ├── admin/
│   │   │       │   ├── auth/       # Login/logout/session
│   │   │       │   │   └── route.ts
│   │   │       │   ├── countries/  # Countries CRUD API
│   │   │       │   │   └── route.ts
│   │   │       │   ├── categories/ # Categories CRUD API
│   │   │       │   │   └── route.ts
│   │   │       │   ├── sources/    # Sources CRUD API
│   │   │       │   │   └── route.ts
│   │   │       │   ├── mappings/   # Source mappings API
│   │   │       │   │   └── route.ts
│   │   │       │   ├── articles/   # Articles API
│   │   │       │   │   └── route.ts
│   │   │       │   ├── clusters/   # Clusters API
│   │   │       │   │   └── route.ts
│   │   │       │   ├── review/     # Review actions API
│   │   │       │   │   └── route.ts
│   │   │       │   ├── fetch/      # Trigger fetch API
│   │   │       │   │   └── route.ts
│   │   │       │   ├── jobs/       # Jobs API
│   │   │       │   │   └── route.ts
│   │   │       │   ├── ranking/    # Ranking rules API
│   │   │       │   │   └── route.ts
│   │   │       │   ├── settings/   # Settings API
│   │   │       │   │   └── route.ts
│   │   │       │   ├── import/     # CSV import API
│   │   │       │   │   └── route.ts
│   │   │       │   ├── export/     # CSV export API
│   │   │       │   │   └── route.ts
│   │   │       │   └── backups/    # Backup API
│   │   │       │       └── route.ts
│   │   │       └── health/         # Health check endpoint
│   │   │           └── route.ts
│   │   │
│   │   ├── components/
│   │   │   ├── public/             # Public-facing components
│   │   │   │   ├── Header.tsx       # Site header with logo
│   │   │   │   ├── Footer.tsx       # Site footer with legal links
│   │   │   │   ├── FilterBar.tsx    # Country, category, sort, time filters
│   │   │   │   ├── StoryCard.tsx    # Individual story card
│   │   │   │   ├── StoryList.tsx    # List of story cards
│   │   │   │   ├── ImpactBadge.tsx  # Impact level badge (High/Med/Low)
│   │   │   │   ├── ConfidenceBadge.tsx # Confidence badge
│   │   │   │   ├── SourceList.tsx   # Source attribution list
│   │   │   │   ├── EmptyState.tsx   # No results state
│   │   │   │   ├── ErrorState.tsx   # Error state with retry
│   │   │   │   └── LoadingSkeleton.tsx # Loading placeholder
│   │   │   │
│   │   │   ├── admin/              # Admin components
│   │   │   │   ├── AdminLayout.tsx # Admin shell with sidebar
│   │   │   │   ├── AdminSidebar.tsx # Navigation sidebar
│   │   │   │   ├── AdminHeader.tsx # Top bar with user info
│   │   │   │   ├── MetricCard.tsx  # Dashboard metric card
│   │   │   │   ├── DataTable.tsx   # Reusable sortable/filterable table
│   │   │   │   ├── ConfirmDialog.tsx # Confirmation modal
│   │   │   │   ├── StatusBadge.tsx # Status indicator badge
│   │   │   │   ├── RankBreakdown.tsx # Rank score breakdown visualization
│   │   │   │   └── JobStatusBadge.tsx # Job status indicator
│   │   │   │
│   │   │   └── ui/                 # Shared UI primitives
│   │   │       ├── Button.tsx
│   │   │       ├── Input.tsx
│   │   │       ├── Select.tsx
│   │   │       ├── Badge.tsx
│   │   │       ├── Card.tsx
│   │   │       ├── Modal.tsx
│   │   │       ├── Toast.tsx
│   │   │       ├── Spinner.tsx
│   │   │       ├── Skeleton.tsx
│   │   │       └── Pagination.tsx
│   │   │
│   │   ├── lib/
│   │   │   ├── prisma.ts           # Prisma client singleton
│   │   │   ├── auth.ts             # Admin authentication helpers
│   │   │   ├── session.ts          # Session management (cookie-based)
│   │   │   ├── validation.ts       # Zod schemas (shared)
│   │   │   ├── utils.ts            # General utility functions
│   │   │   ├── constants.ts        # App constants
│   │   │   ├── db.ts               # Database helpers (WAL pragma, backup)
│   │   │   └── logger.ts           # Structured logging utility
│   │   │
│   │   ├── services/               # Business logic services
│   │   │   ├── fetcher.ts          # RSS/API fetch service
│   │   │   ├── normalizer.ts       # Article normalization
│   │   │   ├── deduplicator.ts     # Exact + near duplicate detection
│   │   │   ├── clusterer.ts        # Story clustering
│   │   │   ├── ranker.ts           # Ranking engine
│   │   │   ├── ai-analyzer.ts      # AI analysis service
│   │   │   ├── job-runner.ts       # Job execution engine
│   │   │   ├── backup.ts           # SQLite backup service
│   │   │   └── csv.ts             # CSV import/export service
│   │   │
│   │   ├── providers/              # External service providers
│   │   │   ├── ai/
│   │   │   │   ├── index.ts        # AI provider interface
│   │   │   │   ├── openai.ts       # OpenAI provider
│   │   │   │   ├── anthropic.ts    # Anthropic provider
│   │   │   │   └── google.ts       # Google provider
│   │   │   │
│   │   │   └── news/
│   │   │       ├── index.ts        # News source interface
│   │   │       ├── rss.ts          # RSS feed parser
│   │   │       └── api.ts          # REST API news source
│   │   │
│   │   ├── middleware.ts           # Next.js middleware (auth, rate limit)
│   │   │
│   │   └── types/                  # TypeScript type definitions
│   │       ├── index.ts            # Shared types
│   │       ├── story.ts            # Story/cluster types
│   │       ├── source.ts           # Source types
│   │       └── job.ts             # Job types
│   │
│   └── scripts/                    # CLI scripts (run via cron or manually)
│       ├── fetch-all.ts            # Fetch all active sources
│       ├── fetch-source.ts         # Fetch a single source by ID
│       ├── cluster-stories.ts      # Run clustering on unclustered articles
│       ├── rank-stories.ts         # Re-rank all clusters
│       ├── ai-analyze.ts           # Generate AI analysis for unanalyzed clusters
│       ├── cleanup.ts              # Clean old articles/logs, prune database
│       ├── backup.ts               # Create SQLite backup
│       ├── health-check.ts         # Check source health, log results
│       └── seed.ts                 # Seed database with 100 initial sources
```

---

## 10. Database Entities

### Entity Relationship Summary

```
countries 1──M source_mappings M──1 sources
    │                                   │
    │                              1    │
    │                                   │
categories 1──M source_mappings          │
    │                                   │
    │                                   │
    ├───────────────────────────────────┤
    │                                   │
    │    1                              │ 1
    │                                   │
raw_articles M──────────────────────story_articles
    │                                   │
    │                                   │ M
    │                                   │
    │                              story_clusters 1──1 ai_story_analysis
    │                                   │
    │                                   │
    └───────────────────────────────────┘
              (via category_id)     (via category_id)

fetch_logs M──1 sources
jobs (standalone)
settings (standalone, key-value)
sessions (standalone, admin sessions)
```

### Prisma Schema — All Entities

#### Country

```prisma
model Country {
  id             Int              @id @default(autoincrement())
  name           String           @unique        // "India", "United States"
  code           String           @unique        // "IN", "US", "GB", "JP", "GLOBAL"
  region         String                          // "Asia", "North America", "Europe"
  defaultLanguage String          @default("en") // "en", "ja", "hi"
  status         String           @default("active") // active | inactive
  displayOrder   Int              @default(0)
  isGlobalOption Boolean          @default(false) // true for Global pseudo-country
  createdAt      DateTime         @default(now())
  updatedAt      DateTime         @updatedAt

  sourceMappings SourceMapping[]
  rawArticles    RawArticle[]

  @@map("countries")
}
```

#### Category

```prisma
model Category {
  id              Int              @id @default(autoincrement())
  name            String           @unique        // "Technology", "Finance & Stock Market"
  slug            String           @unique        // "technology", "finance"
  level           String           @default("country") // global | country | both
  description     String?
  status          String           @default("active")  // active | inactive
  displayOrder    Int              @default(0)
  aiSafetyLevel   String           @default("medium")  // low | medium | high | critical
  maxPublicStories Int             @default(50)
  autoPublish     Boolean          @default(false) // if true, skip review for this category
  createdAt       DateTime         @default(now())
  updatedAt       DateTime         @updatedAt

  sourceMappings  SourceMapping[]
  rawArticles     RawArticle[]
  storyClusters   StoryCluster[]

  @@map("categories")
}
```

#### Source

```prisma
model Source {
  id                    Int              @id @default(autoincrement())
  name                  String           @unique        // "TechCrunch", "BBC News"
  websiteUrl            String                          // "https://techcrunch.com"
  sourceType            String                          // rss | api | manual
  feedEndpoint          String?                         // RSS feed URL
  apiEndpoint           String?                         // API endpoint URL
  apiKeyRef             String?                         // Reference to .env key name (never the key itself)
  language              String           @default("en")
  trustScore            Int              @default(5)    // 1–10
  reliabilityScore      Float            @default(5.0)  // computed from fetch history
  status                String           @default("inactive") // active | inactive | paused | blocked
  fetchFrequencyMinutes Int              @default(30)
  lastFetchedAt         DateTime?
  lastFetchStatus       String?                          // success | partial | failed
  consecutiveFailures   Int              @default(0)
  totalFetches          Int              @default(0)
  totalArticlesFetched  Int              @default(0)
  duplicateRate         Float            @default(0.0)
  errorRate             Float            @default(0.0)
  usageNotes            String?
  createdAt             DateTime         @default(now())
  updatedAt             DateTime         @updatedAt

  sourceMappings        SourceMapping[]
  rawArticles           RawArticle[]
  fetchLogs             FetchLog[]
  storyArticles         StoryArticle[]

  @@map("sources")
}
```

#### SourceMapping

```prisma
model SourceMapping {
  id         Int      @id @default(autoincrement())
  sourceId   Int
  countryId  Int
  categoryId Int
  priority   Int      @default(0)
  status     String   @default("active") // active | inactive
  createdAt  DateTime @default(now())
  updatedAt  DateTime @updatedAt

  source   Source   @relation(fields: [sourceId], references: [id], onDelete: Cascade)
  country  Country  @relation(fields: [countryId], references: [id], onDelete: Cascade)
  category Category @relation(fields: [categoryId], references: [id], onDelete: Cascade)

  @@unique([sourceId, countryId, categoryId])
  @@map("source_mappings")
}
```

#### RawArticle

```prisma
model RawArticle {
  id              Int      @id @default(autoincrement())
  sourceId        Int
  countryId       Int?
  categoryId      Int?
  title           String
  originalUrl     String
  author          String?
  publishedAt     DateTime?
  fetchedAt       DateTime  @default(now())
  rawSnippet      String?                             // Original text snippet (not full article)
  rawContent      String?                             // Full fetched content (headline + snippet)
  language        String    @default("en")
  contentHash     String                              // SHA-256 hash of (normalized title + snippet)
  urlHash         String                              // SHA-256 hash of normalized URL
  titleHash       String                              // SHA-256 hash of normalized title
  status          String    @default("new")           // new | duplicate | clustered | ignored
  duplicateOfId   Int?                                // If duplicate, reference to the original article
  duplicateReason String?                             // url | title | content (why it was marked duplicate)
  createdAt       DateTime  @default(now())
  updatedAt       DateTime  @updatedAt

  source         Source          @relation(fields: [sourceId], references: [id], onDelete: Cascade)
  country        Country?        @relation(fields: [countryId], references: [id])
  category       Category?       @relation(fields: [categoryId], references: [id])
  storyArticles  StoryArticle[]
  duplicateOf    RawArticle?     @relation("DuplicateRef", fields: [duplicateOfId], references: [id])
  duplicates     RawArticle[]    @relation("DuplicateRef")

  @@index([urlHash])
  @@index([titleHash])
  @@index([contentHash])
  @@index([status])
  @@index([sourceId])
  @@index([fetchedAt])
  @@index([publishedAt])
  @@map("raw_articles")
}
```

#### StoryCluster

```prisma
model StoryCluster {
  id                  Int      @id @default(autoincrement())
  canonicalTitle      String                            // Best representative title
  slug                String                            // URL-friendly slug
  categoryId          Int?
  countryId           Int?
  firstSeenAt         DateTime @default(now())
  lastSeenAt          DateTime @default(now())
  sourceCount         Int      @default(0)
  trustedSourceCount  Int      @default(0)
  officialSourcePresent Boolean @default(false)
  rankScore           Float    @default(0.0)
  rankBreakdown       String?                          // JSON string of scoring breakdown
  impactScore         Float    @default(0.0)
  confidenceScore     Float    @default(0.0)
  status              String   @default("pending_review") // pending_review | approved | published | rejected | hidden
  publishedAt         DateTime?
  createdAt           DateTime @default(now())
  updatedAt           DateTime @updatedAt

  category       Category?          @relation(fields: [categoryId], references: [id])
  country        Country?           @relation(fields: [countryId], references: [id])
  storyArticles  StoryArticle[]
  aiAnalysis     AIStoryAnalysis?

  @@index([status])
  @@index([categoryId])
  @@index([countryId])
  @@index([rankScore])
  @@index([firstSeenAt])
  @@map("story_clusters")
}
```

#### StoryArticle

```prisma
model StoryArticle {
  id              Int          @id @default(autoincrement())
  storyClusterId  Int
  rawArticleId    Int
  sourceId        Int
  similarityScore Float        @default(0.0) // How similar this article is to cluster centroid
  isCanonical     Boolean      @default(false) // Is this the canonical article for the cluster?
  createdAt       DateTime     @default(now())

  storyCluster StoryCluster @relation(fields: [storyClusterId], references: [id], onDelete: Cascade)
  rawArticle   RawArticle   @relation(fields: [rawArticleId], references: [id], onDelete: Cascade)
  source       Source       @relation(fields: [sourceId], references: [id])

  @@unique([storyClusterId, rawArticleId])
  @@map("story_articles")
}
```

#### AIStoryAnalysis

```prisma
model AIStoryAnalysis {
  id                Int      @id @default(autoincrement())
  storyClusterId    Int      @unique         // One-to-one with cluster
  summary           String?                  // AI-generated summary
  whyItMatters      String?                  // Why this story matters
  positiveImpact    String?                  // Positive impact analysis
  negativeImpact    String?                  // Negative impact analysis
  affectedGroups    String?                  // JSON array of affected groups
  impactLevel       String?                  // low | medium | high | critical
  confidenceLevel   String?                  // low | medium | high
  confidenceReason  String?                  // Explanation of confidence assessment
  neutralityCheck   String?                  // Assessment of neutrality
  riskWarning       String?                  // Risk/caution warning
  displayHeadline   String?                  // AI-suggested display headline
  modelName         String?                  // Which AI model generated this
  inputSourceCount  Int      @default(0)     // How many sources fed into analysis
  generatedAt       DateTime @default(now())
  humanEdited       Boolean  @default(false) // Has an admin manually edited this?
  createdAt         DateTime @default(now())
  updatedAt         DateTime @updatedAt

  storyCluster StoryCluster @relation(fields: [storyClusterId], references: [id], onDelete: Cascade)

  @@map("ai_story_analysis")
}
```

#### FetchLog

```prisma
model FetchLog {
  id               Int      @id @default(autoincrement())
  sourceId         Int
  startedAt        DateTime @default(now())
  endedAt          DateTime?
  status           String   @default("running") // running | success | partial | failed
  articlesFound    Int      @default(0)
  articlesSaved    Int      @default(0)
  duplicatesFound  Int      @default(0)
  errorsFound      Int      @default(0)
  errorMessage     String?                      // Last error message
  errorTrace       String?                      // Stack trace for debugging
  httpStatus       Int?                         // HTTP status code from source
  responseTimeMs   Int?                         // Response time in milliseconds
  createdAt        DateTime @default(now())

  source Source @relation(fields: [sourceId], references: [id], onDelete: Cascade)

  @@index([sourceId])
  @@index([startedAt])
  @@index([status])
  @@map("fetch_logs")
}
```

#### Job

```prisma
model Job {
  id              Int      @id @default(autoincrement())
  type            String                        // fetch_all | fetch_source | cluster | rank | ai_analyze | cleanup | backup
  status          String   @default("pending")  // pending | running | completed | failed
  targetId        Int?                          // Optional: source ID, cluster ID, etc.
  progress        Int      @default(0)          // 0–100 percentage
  progressMessage String?                       // "Processing source 15/100..."
  totalItems      Int      @default(0)
  processedItems  Int      @default(0)
  resultSummary   String?                       // Human-readable result
  errorMessage    String?
  startedAt       DateTime?
  completedAt     DateTime?
  createdAt       DateTime @default(now())
  updatedAt       DateTime @updatedAt

  @@index([type])
  @@index([status])
  @@index([createdAt])
  @@map("jobs")
}
```

#### Setting

```prisma
model Setting {
  id        Int      @id @default(autoincrement())
  key       String   @unique            // e.g., "admin_password_hash", "site_name"
  value     String                      // String value (JSON for complex settings)
  category  String   @default("general") // general | admin | ai | fetching | display | moderation
  label     String?                     // Human-readable label for UI
  updatedAt DateTime @updatedAt
  createdAt DateTime @default(now())

  @@map("settings")
}
```

#### Session

```prisma
model Session {
  id        Int      @id @default(autoincrement())
  token     String   @unique            // Random session token (SHA-256)
  expiresAt DateTime                    // Session expiry (24h from login)
  createdAt DateTime @default(now())

  @@index([token])
  @@index([expiresAt])
  @@map("sessions")
}
```

---

## 11. Database-Backed Job System

### Concept

Instead of Redis + BullMQ, we use a simple `jobs` table in SQLite. Scripts (triggered by cron or admin) insert a job row, execute work, and update the row. The Next.js admin UI reads job status from the same table.

### Job Types

| Job Type | Description | Trigger |
|----------|-------------|---------|
| `fetch_all` | Fetch all active sources | Cron (every 30 min) or admin manual trigger |
| `fetch_source` | Fetch a single source | Admin "Fetch Now" button |
| `cluster` | Run clustering on unclustered articles | Cron (every 30 min, after fetch) |
| `rank` | Re-rank all story clusters | Cron (every 30 min, after cluster) |
| `ai_analyze` | Generate AI analysis for unanalyzed clusters | Cron (every 30 min, after rank) |
| `ai_analyze_single` | Generate AI analysis for one cluster | Admin manual trigger |
| `cleanup` | Prune old articles, logs, and clusters | Cron (daily) |
| `backup` | Create SQLite backup | Cron (daily) or admin manual trigger |

### Job Lifecycle

```
pending → running → completed
                 ↘ failed
```

### Job Execution Model

```
┌────────────────────────────────────────────────────────────┐
│                    cPanel Cron Job                          │
│  /opt/cpanel/ea-nodejs22/bin/node scripts/fetch-all.js     │
└────────────────────────┬───────────────────────────────────┘
                         │
                         ▼
┌────────────────────────────────────────────────────────────┐
│  scripts/fetch-all.ts                                      │
│                                                            │
│  1. INSERT INTO jobs (type, status) VALUES ('fetch_all',   │
│     'pending')                                             │
│  2. Read job ID                                            │
│  3. UPDATE jobs SET status='running', started_at=NOW()     │
│     WHERE id=$jobId                                        │
│  4. For each active source:                                │
│     - Fetch RSS/API                                        │
│     - Normalize articles                                   │
│     - Detect duplicates                                    │
│     - Save new articles                                    │
│     - Log fetch result                                     │
│     - UPDATE jobs SET progress=N%, processed_items++       │
│  5. On success: UPDATE jobs SET status='completed',        │
│     completed_at=NOW(), result_summary='...'               │
│  6. On failure: UPDATE jobs SET status='failed',           │
│     error_message='...'                                    │
└────────────────────────────────────────────────────────────┘
```

### Concurrency Protection

Only ONE instance of a given job type should run at a time. Before starting, the script checks:

```sql
SELECT COUNT(*) FROM jobs
WHERE type = $type AND status = 'running'
```

If count > 0, the script exits (another instance is already running).

### Job Retention

- Completed jobs: kept for 7 days, then deleted by cleanup script
- Failed jobs: kept for 30 days for debugging
- Running jobs older than 2 hours: marked as failed (stale job detection)

---

## 12. Local Script Strategy

### Philosophy

Scripts are TypeScript files in the `scripts/` directory. They are executed directly via `tsx` (TypeScript Execute) or compiled to JS first. Each script:
- Is self-contained and idempotent
- Uses the shared Prisma client and services
- Logs to both console (captured by cron email) and log files
- Creates/updates a Job row for tracking
- Has clear error handling and exit codes

### Script Architecture

```
Each script follows this pattern:

1. Parse CLI arguments (optional: --source-id, --cluster-id, etc.)
2. Check no conflicting job is already running
3. Create job record
4. Execute work in try/catch
5. Update job record with result
6. Log summary
7. Exit with code 0 (success) or 1 (failure)
```

### Script Details

#### fetch-all.ts
```
Purpose: Fetch all active sources
Frequency: Every 30 minutes via cron
Logic:
  - Create job (type: fetch_all)
  - Query all active sources
  - Sort by last_fetched_at (oldest first)
  - For each source:
    - Check if source is due for fetch based on fetch_frequency_minutes
    - Fetch RSS or API
    - Parse and normalize articles
    - Run deduplication
    - Save new articles
    - Create fetch_log entry
    - Update source last_fetched_at, consecutive_failures, health metrics
    - Update job progress
  - Complete job with summary
```

#### fetch-source.ts
```
Purpose: Fetch a single source (admin triggered)
Usage: tsx scripts/fetch-source.ts --source-id=5
Logic:
  - Create job (type: fetch_source, targetId=5)
  - Fetch, normalize, dedup, save
  - Update source metrics
  - Complete job
```

#### cluster-stories.ts
```
Purpose: Cluster new/uncategorized articles
Frequency: Every 30 minutes via cron (3 min offset from fetch)
Logic:
  - Create job (type: cluster)
  - Query raw_articles with status='new' (not yet clustered)
  - Group by category and country
  - For each group, run clustering service:
    - Generate TF-IDF or embedding vectors for titles
    - Compute cosine similarity between all pairs
    - Group articles above similarity threshold into clusters
    - Create new StoryCluster rows
    - Create StoryArticle rows linking articles to clusters
    - Update raw_article status to 'clustered'
    - For singletons (no similar articles), create single-article clusters
  - Complete job with summary
```

#### rank-stories.ts
```
Purpose: Re-rank all active clusters
Frequency: Every 30 minutes via cron (6 min offset from fetch)
Logic:
  - Create job (type: rank)
  - Query all clusters with status != 'rejected'
  - For each cluster:
    - Compute freshness score
    - Sum source trust scores
    - Apply official source boost
    - Apply category-specific weights
    - Compute final rank_score
    - Store rank_breakdown as JSON
    - Update cluster
  - Sort and mark top N per category/country
  - Complete job
```

#### ai-analyze.ts
```
Purpose: Generate AI analysis for unanalyzed top clusters
Frequency: Every 30 minutes via cron (9 min offset from fetch)
Logic:
  - Create job (type: ai_analyze)
  - Query clusters with no AI analysis, sorted by rank_score DESC
  - Limit to top N per run (e.g., 20) to control AI API costs
  - For each cluster:
    - Gather member article titles and snippets
    - Build AI prompt with content + safety rules for category
    - Call AI provider
    - Parse and validate response against Zod schema
    - Create AIStoryAnalysis row
    - Update job progress
  - Complete job
```

#### cleanup.ts
```
Purpose: Database maintenance
Frequency: Daily at 3:00 AM via cron
Logic:
  - Create job (type: cleanup)
  - Delete raw_articles with status='duplicate' older than 7 days
  - Delete raw_articles with status='ignored' older than 30 days
  - Delete jobs with status='completed' older than 7 days
  - Delete jobs with status='failed' older than 30 days
  - Delete fetch_logs older than 30 days
  - Mark jobs with status='running' and started_at > 2 hours ago as failed
  - Run VACUUM on SQLite (reclaim space) — offline, when app is idle
  - Complete job with summary
```

#### backup.ts
```
Purpose: Create SQLite backup
Frequency: Daily at 2:00 AM via cron
Logic:
  - Create job (type: backup)
  - Use SQLite .backup API (via Prisma raw query or better-sqlite3)
  - Copy database to backups/backup-YYYY-MM-DD-HHmmss.db
  - Keep last 7 daily backups, delete older ones
  - Log backup size
  - Complete job
```

#### health-check.ts
```
Purpose: Check source health, detect broken feeds
Frequency: Every 6 hours via cron
Logic:
  - Query all active sources
  - Check consecutive_failures > threshold → mark status='paused', flag for admin
  - Check error_rate > threshold → lower reliability_score
  - Log health report to logs/health.log
  - No job record needed (lightweight)
```

#### seed.ts
```
Purpose: Seed database with initial 100 sources, countries, categories
Usage: tsx scripts/seed.ts (one-time, after first deployment)
Logic:
  - Create 5 countries (Global, India, US, UK, Japan)
  - Create 7 categories with appropriate levels
  - Create 100 sources (from MVP seed list) with feed_endpoint=null
  - Create source mappings linking sources to countries/categories
  - All sources initially inactive (admin must verify and activate)
  - Create default settings (admin password hash, site config)
  - Idempotent: skip if data already exists
```

---

## 13. cPanel Cron Strategy

### Cron Job Configuration (in cPanel Cron Job Manager)

All scripts run from the project root. Output is captured to log files.

```
# Fetch all active sources — every 30 minutes at :05 and :35
*/30 * * * * cd /home/headlinesift/headlinesift && /opt/cpanel/ea-nodejs22/bin/node --import tsx scripts/fetch-all.ts >> logs/fetch.log 2>&1

# Cluster new articles — every 30 minutes at :08 and :38 (3 min offset)
*/30 * * * * cd /home/headlinesift/headlinesift && /opt/cpanel/ea-nodejs22/bin/node --import tsx scripts/cluster-stories.ts >> logs/cluster.log 2>&1

# Rank clusters — every 30 minutes at :11 and :41 (6 min offset)
*/30 * * * * cd /home/headlinesift/headlinesift && /opt/cpanel/ea-nodejs22/bin/node --import tsx scripts/rank-stories.ts >> logs/rank.log 2>&1

# AI analysis — every 30 minutes at :14 and :44 (9 min offset)
*/30 * * * * cd /home/headlinesift/headlinesift && /opt/cpanel/ea-nodejs22/bin/node --import tsx scripts/ai-analyze.ts >> logs/ai.log 2>&1

# SQLite backup — daily at 2:00 AM
0 2 * * * cd /home/headlinesift/headlinesift && /opt/cpanel/ea-nodejs22/bin/node --import tsx scripts/backup.ts >> logs/backup.log 2>&1

# Cleanup — daily at 3:00 AM
0 3 * * * cd /home/headlinesift/headlinesift && /opt/cpanel/ea-nodejs22/bin/node --import tsx scripts/cleanup.ts >> logs/cleanup.log 2>&1

# Source health check — every 6 hours
0 */6 * * * cd /home/headlinesift/headlinesift && /opt/cpanel/ea-nodejs22/bin/node --import tsx scripts/health-check.ts >> logs/health.log 2>&1
```

### Cron Schedule Summary

```
Time:     :00   :05   :08   :11   :14   :30   :35   :38   :41   :44
Cron:     health    fetch cluster rank  AI          fetch cluster rank  AI
                                     analysis                   analysis

Daily @ 2:00 AM: backup
Daily @ 3:00 AM: cleanup
```

### Why Phased Offsets

- Fetch runs first and produces new articles
- Cluster runs 3 minutes later to process those articles
- Rank runs 6 minutes later to score the new clusters
- AI runs 9 minutes later to analyze the top-ranked new clusters
- Each phase depends on the previous one completing
- 30-minute gaps give enough buffer; if a run takes >3 minutes, the next phase still works with what's available

---

## 14. Deduplication Strategy

### Three-Layer Approach

#### Layer 1: Exact Duplicate (Pre-save)

Before inserting any article, compute and check:

| Hash | Algorithm | Input | Purpose |
|------|-----------|-------|---------|
| `url_hash` | SHA-256 | Normalized URL (lowercase, strip protocol, strip www, strip trailing slash, strip query params except essential ones, strip UTM params) | Same URL = same article |
| `title_hash` | SHA-256 | Normalized title (lowercase, strip extra whitespace, remove common prefixes like "BREAKING:", "UPDATE:") | Same title from same/related source |
| `content_hash` | SHA-256 | Normalized (title + first 500 chars of content), lowercased, whitespace-normalized | Same content body |

**Logic:**
```sql
-- Before insert, check all three hashes
SELECT id FROM raw_articles
WHERE url_hash = $newUrlHash
   OR title_hash = $newTitleHash
   OR content_hash = $newContentHash
LIMIT 1
```

If a match is found:
- Mark new article as `status='duplicate'`, `duplicateOfId=<matched_id>`
- Increment `duplicatesFound` counter on fetch_log
- Do NOT add to any story cluster

#### Layer 2: Near-Duplicate (Post-save, pre-clustering)

After saving new articles, check for near-duplicates using **Levenshtein distance** or **Jaro-Winkler similarity** on titles:

```typescript
// For each new article title, compare against recent (last 24h) article titles
// from different sources in the same category
const threshold = 0.85 // 85% title similarity = near duplicate

if (jaroWinklerSimilarity(title1, title2) > threshold) {
  // Mark the lower-trust-source article as duplicate
  // or merge them into the same cluster at clustering time
}
```

This catches cases like:
- "RBI keeps repo rate unchanged" vs "Reserve Bank leaves policy rate unchanged"
- "Fed raises interest rates by 0.25%" vs "Federal Reserve hikes rates 25 basis points"

#### Layer 3: Same Story (During Clustering)

Handled by the clustering engine (see Section 15). Articles about the same event but with different angles are grouped together, not deduplicated.

### Deduplication Flow

```
New Article Fetched
        │
        ▼
┌───────────────────┐
│ Layer 1: Exact    │──match──▶ Mark duplicate, skip
│ URL/Title/Content │
└───────┬───────────┘
        │ no match
        ▼
┌───────────────────┐
│ Save as new       │
│ article           │
└───────┬───────────┘
        │
        ▼
┌───────────────────┐
│ Layer 2: Near-    │──match──▶ Mark duplicate, note similarity
│ Duplicate Title   │
└───────┬───────────┘
        │ no match
        ▼
┌───────────────────┐
│ Layer 3: Story    │──match──▶ Group in same cluster
│ Clustering        │
└───────┬───────────┘
        │ no match
        ▼
   New unique story
   (own cluster)
```

---

## 15. Story Clustering Strategy

### Approach: TF-IDF + Cosine Similarity

For MVP, use a lightweight in-process approach without external vector databases.

#### Step 1: Preprocessing

For each new article with `status='new'` that survived deduplication:
1. Extract title and first 200 characters of content
2. Lowercase, remove stop words, strip punctuation
3. Optionally, translate non-English titles to English (future enhancement)
4. Group articles by (countryId, categoryId) — cluster within same scope

#### Step 2: TF-IDF Vectorization

```
For each batch of articles in the same (country, category):
  1. Build vocabulary (all unique terms across all articles in batch)
  2. Compute TF-IDF vector for each article title+content
  3. Output: sparse numeric vectors of equal length
```

Implementation note: Use a lightweight JS library like `natural` or a hand-rolled TF-IDF (the math is straightforward for MVP volumes).

#### Step 3: Cosine Similarity Matrix

```
For each pair of articles (i, j) in the batch:
  similarity = cosine(tfidf_i, tfidf_j)
  
  If similarity >= 0.65 (configurable threshold):
    → Group together (same story)
  Else:
    → Different stories
```

#### Step 4: Cluster Assignment

```
For each connected component in the similarity graph:
  - If component has >= 2 articles:
    → Create or merge into a StoryCluster
  - If component has 1 article (singleton):
    → Create a single-article StoryCluster
  
  - Select canonical_title:
    - Prefer title from highest-trust source
    - Or shortest clear title
    - Or title from the article with most sources
  
  - Create StoryArticle rows linking each article to its cluster
  - Update raw_article status to 'clustered'
```

#### Step 5: Merge with Existing Clusters

New clusters may match existing clusters. After creating new clusters:

```
For each new cluster:
  Compare canonical_title against existing clusters (last 7 days, same category/country)
  If cosine similarity >= 0.60 with an existing cluster:
    → Merge: add articles to existing cluster, update lastSeenAt, recalculate sourceCount
  Else:
    → Keep as separate new cluster
```

### Clustering Configuration

| Parameter | Default | Description |
|-----------|---------|-------------|
| Similarity threshold (new clusters) | 0.65 | Minimum cosine similarity to group articles |
| Similarity threshold (merge existing) | 0.60 | Minimum similarity to merge with existing cluster |
| Lookback window | 7 days | How far back to consider existing clusters for merging |
| Max articles per cluster | 100 | Cap to prevent runaway clusters |
| Min title length | 15 chars | Ignore articles with very short titles |

### Performance Considerations

- Cluster in batches per (country, category) — never the entire articles table at once
- Limit each batch to articles from the last 24 hours
- For very large batches (>1000 articles), use random sampling for pairwise comparison
- Store similarity computations in memory; SQLite is only for persistence

---

## 16. Ranking Strategy

### Ranking Formula

```
Rank Score = Σ (factor_weights × factor_values) for each active factor
```

All factor weights are configurable via Admin → Ranking Rules and stored as JSON in the `settings` table.

### Default Ranking Factors

#### A. Freshness Score

| Condition | Points |
|-----------|--------|
| Story first seen within 1 hour | +20 |
| Story first seen within 6 hours | +15 |
| Story first seen within 24 hours | +10 |
| Story first seen within 3 days | +5 |
| Story first seen within 7 days | +2 |
| Older than 7 days | 0 |

#### B. Source Trust Score

Sum of trust scores of all sources covering the story, normalized:

```
trustSourceScore = min(Σ(source.trustScore) / maxPossibleTrust, 25)
```

Where `maxPossibleTrust = sourceCount × 10`.

#### C. Source Count Score

| Condition | Points |
|-----------|--------|
| 10+ unique sources | +20 |
| 5–9 unique sources | +10 |
| 3–4 unique sources | +5 |
| 1–2 unique sources | +2 |

#### D. Official Source Boost

If at least one source has `trustScore >= 9` (official government/regulatory):

| Condition | Points |
|-----------|--------|
| Official source present | +25 |

#### E. Impact Score (from AI analysis)

| Impact Level | Points |
|--------------|--------|
| Critical | +25 |
| High | +20 |
| Medium | +10 |
| Low | +5 |

#### F. Category Priority Boost

Each category can have a priority multiplier (1.0 = no boost, 2.0 = double):

| Category | Default Priority Multiplier |
|----------|----------------------------|
| Breaking News | 1.5 |
| Finance & Stock Market | 1.3 |
| Health & Wellness | 1.2 |
| Business & Economy | 1.1 |
| Technology | 1.0 |
| Science & Space | 1.0 |
| Education & Careers | 1.0 |

#### G. Penalties (subtractions)

| Condition | Points |
|-----------|--------|
| Low confidence (AI confidence = low) | -20 |
| Medium confidence | -10 |
| Clickbait detected (title has excessive caps, exclamation marks, sensational words) | -15 |
| Duplicate/repetitive within same category | -10 |
| Source(s) have low trust (avg trustScore < 3) | -20 |
| Stale story (> 48 hours, no updates) | -10 |

### Category-Specific Scoring Adjustments

#### Breaking News
- Freshness weight: 2.0× (double)
- Source trust weight: 1.5×
- Impact weight: 1.2×

#### Finance & Stock Market
- Official source bonus: 1.5×
- Source trust weight: 2.0×
- Freshness: standard

#### Health & Wellness
- Official source bonus: 2.0× (must have credible health source)
- Impact weight: 1.5×
- Clickbait penalty: 2.0× (severe penalty for sensational health claims)

#### Education & Careers
- Official source bonus: 1.5×
- Freshness: 0.8× (education news has longer shelf life)

#### Science & Space
- Official source bonus: 2.0× (must have credible scientific source)
- Source trust weight: 2.0×

#### Technology
- Source trust weight: 1.2×
- Impact weight: 1.0×
- Clickbait penalty: 1.5×

### Top-N Selection Per Category

After ranking, select the top N clusters per (country, category):

```
N = category.maxPublicStories (default 50)

For each (country, category) combination:
  SELECT clusters WHERE status = 'published' (or 'approved' if auto-publish)
  ORDER BY rank_score DESC
  LIMIT N
```

### Rank Breakdown Storage

Each cluster stores its rank breakdown as JSON:

```json
{
  "freshness": 20,
  "sourceTrust": 18,
  "sourceCount": 10,
  "officialSource": 25,
  "impact": 20,
  "categoryPriority": 1.0,
  "clickbaitPenalty": 0,
  "confidencePenalty": 0,
  "duplicatePenalty": 0,
  "total": 93
}
```

This allows the admin UI to show exactly why a story is ranked where it is.

---

## 17. AI Analysis Strategy

### AI Provider Abstraction

The system supports multiple AI providers through a common interface:

```typescript
interface AIProvider {
  name: string;
  analyze(params: AnalyzeParams): Promise<AIResult>;
}

interface AnalyzeParams {
  title: string;
  articles: { title: string; snippet: string; sourceName: string; trustScore: number }[];
  category: { name: string; aiSafetyLevel: string };
  sourceCount: number;
  trustedSourceCount: number;
  officialSourcePresent: boolean;
}
```

Supported providers (configurable in settings):
- **Anthropic** (Claude) — recommended for nuanced analysis
- **OpenAI** (GPT-4o) — strong general purpose
- **Google** (Gemini) — cost-effective option

The active provider is set in Admin → Settings. Provider API key stored in `.env`.

### AI Prompt Template

```
You are an AI news analyst for HeadlineSift.com. Your job is to analyze a news story cluster and produce a structured analysis.

## Input
The following story is covered by {sourceCount} sources ({trustedSourceCount} trusted, official source: {officialSourcePresent}).

Category: {categoryName}
Articles in this cluster:
{for each article}
- [{sourceName}] (trust: {trustScore}): {title}
  Snippet: {snippet}
{end for}

## Safety Rules for {categoryName}
{categorySafetyRules}

## Instructions
Based ONLY on the provided source material, generate the following. Do NOT invent facts. If information is insufficient, state that clearly.

1. **summary**: A 2-3 sentence neutral summary of what happened.
2. **whyItMatters**: 1-2 sentences explaining why this is important to the reader.
3. **positiveImpact**: What positive outcomes could result from this? Be specific. If none are clear, say "No clear positive impact identified."
4. **negativeImpact**: What negative outcomes or risks could result? Be specific. If none are clear, say "No clear negative impact identified."
5. **affectedGroups**: JSON array of groups affected (e.g., ["borrowers", "investors", "homeowners"]). Max 5 groups.
6. **impactLevel**: One of: low, medium, high, critical. Consider: how many people affected, severity, urgency, long-term significance.
7. **confidenceLevel**: One of: low, medium, high. Consider: number of sources, trust level of sources, consistency across sources, presence of official sources.
8. **confidenceReason**: One sentence explaining the confidence level.
9. **neutralityCheck**: Brief assessment of whether the source material is neutral or biased. Flag any emotionally charged language.
10. **riskWarning**: Any caution the reader should have. For finance: "Not financial advice." For health: "Not medical advice." For breaking: "Developing story, details may change." Otherwise, null.
11. **displayHeadline**: A clear, neutral headline (max 100 chars) suitable for display. Prefer factual over sensational.

## Output Format
Return ONLY valid JSON matching this schema. No markdown, no additional text.
```

### Category-Specific Safety Rules Injected into Prompt

#### Health & Wellness
```
- Do NOT give medical advice.
- Do NOT claim something cures, treats, or prevents a disease unless stated by an official health source (WHO, CDC, NIH, etc.).
- Use cautious wording: "may help," "studies suggest," "evidence is preliminary."
- If the story involves a treatment or drug trial, clearly state the phase and limitations.
- Flag any health claims not backed by an official source in the riskWarning.
```

#### Finance & Stock Market
```
- Do NOT give investment advice.
- Do NOT predict market movements or stock prices.
- Use "may affect" instead of "will affect."
- Include "Not financial advice." in the riskWarning for any story that could be seen as actionable.
- Prefer official sources (central banks, regulators, exchanges) over analyst opinions.
```

#### Education & Careers
```
- Prefer official education sources (government education departments, accredited institutions).
- For exam/deadline stories, clearly state dates and verify from official sources.
- Flag stories with uncertain or vague deadlines.
```

#### Breaking News
```
- Mark as "Developing story, details may change." in riskWarning if the situation is still unfolding.
- Do not overstate casualty numbers, damage, or political claims.
- Use cautious language for unconfirmed reports: "reportedly," "according to initial reports."
```

#### Science & Space
```
- Distinguish between peer-reviewed findings and preprints.
- Use cautious language for preliminary research: "early findings suggest," "has not yet been peer-reviewed."
- Flag if a discovery claim comes from a single source without independent verification.
```

#### Technology
```
- Distinguish between announced products and rumors/leaks.
- Flag if a story is based on a single unverified source.
- For cybersecurity stories, include severity context but do not cause unnecessary alarm.
```

### AI Result Validation

The AI response is validated with Zod before saving:

```typescript
const aiAnalysisSchema = z.object({
  summary: z.string().min(20).max(500),
  whyItMatters: z.string().min(10).max(300),
  positiveImpact: z.string().min(5).max(500),
  negativeImpact: z.string().min(5).max(500),
  affectedGroups: z.array(z.string()).max(5),
  impactLevel: z.enum(['low', 'medium', 'high', 'critical']),
  confidenceLevel: z.enum(['low', 'medium', 'high']),
  confidenceReason: z.string().min(10).max(300),
  neutralityCheck: z.string().min(5).max(300),
  riskWarning: z.string().nullable(),
  displayHeadline: z.string().min(5).max(100),
});

// If validation fails, log the raw response and set confidence to 'low'
// Admin can manually edit in review queue
```

### AI Cost Control

| Control | Value |
|---------|-------|
| Max clusters analyzed per run | 20 (configurable) |
| Max input tokens per analysis | 4000 |
| Max output tokens | 1000 |
| Temperature | 0.3 (low for factual consistency) |
| Retry on validation failure | 1 retry |
| Cache | Do not regenerate if cluster hasn't changed |
| Daily AI call limit | Configurable in settings (default 500) |

### AI Analysis Cache Logic

```
Before generating AI analysis:
  - Check if cluster already has ai_story_analysis
  - If yes, check if cluster has changed:
    - sourceCount changed by > 20%?
    - New articles added since last analysis?
    - humanEdited = false? (if admin edited, don't overwrite)
  - If unchanged, skip
  - If changed, regenerate only if change is significant
```

---

## 18. Admin Moderation Workflow

### Review Pipeline

```
Story Cluster Created (status: pending_review)
        │
        ▼
┌───────────────┐     ┌──────────────────────┐
│ Auto-publish? │─Yes─▶ status = published    │
│ (category     │      │ publishedAt = NOW()  │
│  autoPublish  │      └──────────────────────┘
│  = true)      │
└───────┬───────┘
        │ No
        ▼
┌───────────────────┐
│ Appears in Review  │
│ Queue for admin    │
└───────┬───────────┘
        │
        ▼
   Admin reviews
        │
   ┌────┼────────────┐
   ▼    ▼            ▼
Approve  Edit      Reject
   │    │            │
   │    ▼            ▼
   │  Admin edits  status = rejected
   │  AI fields    (kept in DB,
   │  (humanEdited  hidden from public)
   │   = true)
   │    │
   ▼    ▼
Publish
(status = published,
 publishedAt = NOW())
```

### Review Queue UI

The Review Queue page shows:
- Filter tabs: All Pending | Health | Finance | Breaking News | Education | Other
- Each cluster is a card showing:
  - Canonical title
  - Category + Country badges
  - Rank score with breakdown
  - Source count + official source indicator
  - AI analysis (collapsed by default, expandable)
  - Quick actions: ✓ Approve & Publish | ✗ Reject | ✎ Edit | 👁 Hide

### Manual Review Triggers

The following conditions force manual review (even if auto-publish is on for the category):

| Trigger | Reason |
|---------|--------|
| AI confidence = low | Needs human judgment |
| impactLevel = critical | High-stakes story needs verification |
| neutralityCheck flags bias | Potentially biased coverage |
| riskWarning is not null | Has risk warning that needs human review |
| Single source only | Only one source reporting — may be unverified |
| Category aiSafetyLevel = critical | Category requires mandatory human review |

### Batch Operations

Admin can select multiple clusters and:
- **Approve & Publish All**: Publishes all selected (confidence >= medium only; low-confidence stories require individual approval)
- **Reject All**: Rejects all selected with optional note

### Edit Workflow

Admin can edit any AI-generated field:
- Opens inline editor or modal
- Edits are saved immediately
- `humanEdited` flag is set to `true`
- Future AI re-analysis will NOT overwrite human-edited fields
- Admin can click "Reset to AI" to discard edits and restore AI output

### Rejection Reasons

When rejecting, admin selects a reason:
- Low quality / inaccurate
- Duplicate of another story
- Not newsworthy
- Inappropriate content
- Source material insufficient
- Other (with note)

### Publishing Rules

- Only clusters with `status='published'` appear on the public page
- A cluster needs ALL of: AI analysis generated, admin approval (or auto-publish), status='published'
- Once published, the cluster appears in top-N selection for its category/country
- If a cluster is later hidden, it's immediately removed from public view

---

## 19. Source Management Workflow

### Source Lifecycle

```
Source Added (status: inactive)
        │
        ▼
┌─────────────────────┐
│ Admin verifies      │
│ feed/API endpoint   │
│ (Test Feed button)  │
└───────┬─────────────┘
        │
   ┌────┼────────────┐
   ▼    ▼            ▼
Active  Inactive   Blocked
(P0)   (kept,     (known bad,
       not used)   never fetch)
   │
   ▼
Source is fetched on schedule
   │
   │ (consecutive failures >= 3)
   ▼
Source marked as Paused (auto)
Admin notified to investigate
   │
   │ (admin fixes and reactivates)
   ▼
Active again
```

### Adding a Source

1. Admin navigates to Sources → Add Source
2. Fills in:
   - Name (required)
   - Website URL (required)
   - Source type: RSS / API / Manual
   - Feed or API endpoint URL (if known)
   - API key reference (if API type)
   - Language (default: English)
   - Trust score (default based on type, admin can override)
   - Fetch frequency (default based on category mapping)
   - Status (default: inactive)
   - Usage notes
3. Saves → source created with status='inactive'
4. Admin clicks "Test Feed" to verify the endpoint works
5. If test succeeds → admin sets status='active'
6. Admin creates source mappings (country + category assignments)

### Source Trust Score Defaults

| Source Type | Default Trust Score |
|-------------|-------------------|
| Official government/agency | 10 |
| Central bank / regulator / stock exchange | 10 |
| Major wire service (AP, Reuters, AFP) | 9 |
| Major newspaper/channel (BBC, NYT, etc.) | 8 |
| Specialist publication (TechCrunch, Nature, etc.) | 7 |
| Small niche website | 5 |
| Unknown blog | 3 |

### Testing a Source Feed

Admin clicks "Test Feed" on a source:
1. Fetcher attempts to connect to the RSS/API endpoint
2. Parses and displays up to 5 sample articles
3. Shows response time, HTTP status, article count
4. Does NOT save articles to database
5. Admin sees results and decides to activate or edit

### Importing Sources via CSV

See Section 21.

---

## 20. Source Health Monitoring

### Health Metrics Per Source

Each source has computed health metrics:

| Metric | Formula | Healthy Range |
|--------|---------|---------------|
| `reliabilityScore` | 1.0 - (errorRate + duplicateRate) | > 0.7 |
| `consecutiveFailures` | Counter, reset on success | < 3 |
| `errorRate` | errorsFound / totalArticlesAttempted (last 30 days) | < 0.1 |
| `duplicateRate` | duplicatesFound / articlesFound (last 30 days) | < 0.3 |
| `responseTimeMs` | Average from last 10 fetch logs | < 10000 |

### Auto-Pause Logic

The health check script (`health-check.ts`) runs every 6 hours and:

```
For each active source:
  IF consecutiveFailures >= 5:
    → Set status = 'paused'
    → Log warning to logs/health.log
  IF errorRate > 0.5 AND totalFetches >= 10:
    → Set status = 'paused'
    → Log warning
  IF duplicateRate > 0.8 AND totalFetches >= 10:
    → Set reliabilityScore -= 0.1
    → Log warning (feed may have changed)
```

### Health Dashboard

Admin → Sources page shows a health column with color indicators:
- 🟢 Healthy: consecutiveFailures = 0, errorRate < 0.1
- 🟡 Warning: 1-2 consecutive failures OR errorRate 0.1-0.3
- 🔴 Critical: 3+ consecutive failures OR errorRate > 0.3
- ⚫ Paused: auto-paused, needs admin attention
- ⚪ Inactive: never activated

### Admin Notification

When a source is auto-paused:
- A prominent banner appears on the Admin Dashboard: "⚠ 3 sources have been paused due to failures. [View]"
- The sources appear at the top of the Sources page, filtered by "Needs Attention"

---

## 21. CSV Import/Export Plan

### CSV Export

Admin can export Sources, Categories, and Source Mappings as CSV.

**Sources CSV columns:**
```
name, websiteUrl, sourceType, feedEndpoint, apiEndpoint, apiKeyRef,
language, trustScore, status, fetchFrequencyMinutes, usageNotes
```

**Categories CSV columns:**
```
name, slug, level, description, status, displayOrder, aiSafetyLevel, maxPublicStories
```

**Source Mappings CSV columns:**
```
sourceName, countryCode, categorySlug, priority, status
```

Export is generated server-side and streamed as a downloadable CSV file.

### CSV Import (Sources Only for MVP)

**Flow:**
1. Admin uploads a CSV file
2. Server parses CSV and validates each row
3. Preview shows: total rows, valid rows, rows with errors, rows with warnings
4. Validation rules per row:
   - `name`: required, unique, max 200 chars
   - `websiteUrl`: required, valid URL format
   - `sourceType`: must be one of rss/api/manual
   - `feedEndpoint`: optional, valid URL if provided
   - `trustScore`: integer 1–10, default 5
   - `status`: if provided, must be active/inactive/paused/blocked
   - `fetchFrequencyMinutes`: integer 5–1440, default 30
5. Admin reviews preview, can deselect rows with errors
6. Admin clicks "Import Valid Rows"
7. Valid rows are inserted; error rows are reported in downloadable error CSV
8. Import creates a Job record for tracking

### Bulk Mapping via CSV

After importing sources, admin can use the "Bulk Create Mappings" tool:
- Select source(s) from multi-select
- Select country/countries
- Select category/categories
- System creates all combinations as source mappings

---

## 22. SQLite Backup and Restore Strategy

### Backup Method

Use SQLite's native backup API via the `better-sqlite3` package (which supports the backup API). The backup script:

```
1. Open source database (headlinesift.db)
2. Open destination database (backups/backup-YYYY-MM-DD-HHmmss.db)
3. Call sqlite3_backup_init(), sqlite3_backup_step(), sqlite3_backup_finish()
4. This creates a consistent snapshot even with WAL mode
5. Close destination
6. Log backup size and duration
```

### Backup Schedule

| Type | Frequency | Retention | Trigger |
|------|-----------|-----------|---------|
| Daily auto | Every day at 2:00 AM | Keep last 7 | cPanel cron |
| Manual | On demand | Keep until deleted | Admin UI button |

### Backup Storage

```
/home/headlinesift/headlinesift/backups/
├── backup-2026-06-09-020001.db   (9 Jun, 2:00 AM)
├── backup-2026-06-08-020002.db   (8 Jun, 2:00 AM)
├── backup-2026-06-07-020001.db   (7 Jun, 2:00 AM)
├── ... (rolling 7-day retention)
└── backup-2026-06-03-020003.db   (3 Jun, 2:00 AM) → deleted next run
```

### Manual Backup via Admin

- Admin clicks "Create Backup Now" → backup created immediately
- Admin can download any backup file
- Admin can delete old backups
- Manual backups follow same naming convention but are exempt from auto-deletion

### Restore Procedure

Restore is a manual process (not via UI for safety):

```bash
# 1. Stop the Next.js app
pm2 stop headlinesift

# 2. Backup current DB (just in case)
cp /home/headlinesift/headlinesift/data/headlinesift.db \
   /home/headlinesift/headlinesift/data/headlinesift.db.before-restore

# 3. Copy backup over current DB
cp /home/headlinesift/headlinesift/backups/backup-YYYY-MM-DD-HHmmss.db \
   /home/headlinesift/headlinesift/data/headlinesift.db

# 4. Start the app
pm2 start headlinesift
```

For safety, the restore command is a documented manual procedure. The admin UI can show the backup files but not trigger restore (to prevent accidental data loss).

### Backup Verification

After each backup:
1. Open backup file
2. Run `PRAGMA integrity_check;`
3. Log result
4. If integrity check fails, log error and alert admin

---

## 23. SQLite Limitations and Write Conflict Avoidance

### Key SQLite Limitations for This Application

| Limitation | Impact | Mitigation |
|------------|--------|------------|
| Single writer at a time | Concurrent write operations queue up | Serialize all writes through app; cron scripts write one at a time |
| No row-level locking | Entire database is locked during writes | Keep transactions short; use WAL mode |
| Busy timeout on contention | Write fails if lock held too long | Set `busy_timeout=5000` (5-second wait) |
| Limited concurrency | Many simultaneous readers + one writer can cause "database is locked" | READ operations are fine in WAL mode; WRITE operations are serial |
| No built-in replication | Single point of failure | Daily backups mitigate data loss risk |
| File size limit | Theoretical 281 TB limit; practical performance degrades above ~1 GB | MVP volume is tiny; cleanup script prunes old data |
| No geographic distribution | Single server | Fine for MVP; future migration path documented |

### WAL Mode Configuration

```prisma
// In prisma/schema.prisma
datasource db {
  provider = "sqlite"
  url      = env("DATABASE_URL")
}

// Applied via migration or startup:
// PRAGMA journal_mode=WAL;
// PRAGMA busy_timeout=5000;
// PRAGMA foreign_keys=ON;
// PRAGMA synchronous=NORMAL;
```

Prisma migration includes these pragmas in the migration SQL.

### Write Serialization Strategy

Since all writes go through the same SQLite file, we ensure:

1. **Cron scripts never overlap**: Staggered cron times (Section 13) with 3-minute gaps
2. **Single script instance**: Each script checks for running jobs of its type before starting
3. **Short transactions**: Keep write transactions as brief as possible
4. **Prisma connection pooling**: Use a single Prisma client instance (`lib/prisma.ts` singleton)
5. **Retry with backoff**: If a write fails due to locking, retry up to 3 times with exponential backoff (100ms, 200ms, 400ms)
6. **WAL checkpoint**: Periodic WAL checkpoint via cleanup script to prevent WAL file from growing too large

### Prisma Client Singleton

```typescript
// src/lib/prisma.ts
import { PrismaClient } from '@prisma/client'

const globalForPrisma = globalThis as unknown as {
  prisma: PrismaClient | undefined
}

export const prisma = globalForPrisma.prisma ?? new PrismaClient({
  log: process.env.NODE_ENV === 'development' ? ['query', 'error', 'warn'] : ['error'],
})

if (process.env.NODE_ENV !== 'production') globalForPrisma.prisma = prisma
```

### Write Conflict Handling in Services

```typescript
// Utility for write operations with retry
async function withWriteRetry<T>(operation: () => Promise<T>, maxRetries = 3): Promise<T> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await operation()
    } catch (error) {
      if (
        attempt < maxRetries - 1 &&
        error instanceof Error &&
        (error.message.includes('SQLITE_BUSY') || error.message.includes('database is locked'))
      ) {
        const delay = 100 * Math.pow(2, attempt) // 100, 200, 400ms
        await new Promise(resolve => setTimeout(resolve, delay))
        continue
      }
      throw error
    }
  }
  throw new Error('Max retries exceeded')
}
```

### Expected MVP Volume

| Table | Estimated Rows (3 months) |
|-------|--------------------------|
| countries | 10 |
| categories | 15 |
| sources | 200 |
| source_mappings | 1000 |
| raw_articles | 50,000 |
| story_clusters | 5,000 |
| story_articles | 15,000 |
| ai_story_analysis | 3,000 |
| fetch_logs | 20,000 |
| jobs | 2,000 |
| settings | 50 |
| sessions | 10 |

**Total estimated rows: ~96,000** — well within SQLite's comfort zone.

### Database File Size Estimate

With an average of ~500 bytes per row (including indexes), the database should be approximately **50–100 MB** after 3 months. Cleanup scripts keep this in check.

---

## 24. Future Migration Path from SQLite to PostgreSQL

### When to Migrate

Consider migration when:
- Write contention becomes a real bottleneck (many concurrent admin users + cron jobs)
- Database size exceeds 500 MB with active cleanup
- Need for replication or high availability
- Multiple application servers need to share one database
- Team grows beyond solo admin

### Migration Strategy

The codebase is designed to make migration straightforward:

1. **Prisma ORM as Abstraction Layer**: All database access goes through Prisma. Changing from SQLite to PostgreSQL requires:
   - Update `prisma/schema.prisma`: change provider from `"sqlite"` to `"postgresql"`
   - Update `DATABASE_URL` in `.env`
   - Run `prisma migrate dev` to generate PostgreSQL migrations
   - No application code changes needed (Prisma abstracts the dialect)

2. **Schema Compatibility**: The MVP schema avoids SQLite-specific features:
   - No `AUTOINCREMENT` (uses `@default(autoincrement())` which Prisma maps correctly)
   - No SQLite-specific functions in queries
   - No raw SQL queries (uses Prisma Client for all operations)
   - DateTime fields are Prisma `DateTime` (mapped correctly to both SQLite and PostgreSQL)

3. **Data Migration**:
   ```bash
   # Export SQLite to SQL dump
   sqlite3 headlinesift.db .dump > export.sql
   
   # Transform for PostgreSQL (minor syntax differences)
   # ... or use a tool like pgloader
   
   # Import to PostgreSQL
   psql -d headlinesift -f export.sql
   ```

4. **Schema Changes for PostgreSQL**:
   - `String` fields in SQLite have no length limit; PostgreSQL `TEXT` is equivalent
   - `Boolean` in SQLite is integer 0/1; Prisma handles this transparently
   - `DateTime` in SQLite is stored as string/numeric; Prisma handles this
   - `Float` is the same in both

5. **What Would Change**:
   - Replace SQLite backup strategy with `pg_dump`
   - Replace WAL pragmas with PostgreSQL config
   - Can add Redis/BullMQ for job queue (the job system was designed to be replaceable)
   - Can add connection pooling for concurrent access

### Migration Cost Estimate

| Item | Effort |
|------|--------|
| Provision PostgreSQL (Supabase/RDS) | 1 hour |
| Update Prisma schema and regenerate | 30 minutes |
| Migrate data | 1–2 hours |
| Update .env and deploy | 30 minutes |
| Testing | 2–4 hours |
| **Total** | **~1 day** |

---

## 25. Security Considerations

### Admin Authentication

- Single admin password stored as bcrypt hash in `settings` table (key: `admin_password_hash`)
- Login creates a session token (SHA-256 random), stored in `sessions` table with 24-hour expiry
- Session token set as httpOnly, secure, sameSite=strict cookie
- Rate limiting: max 5 login attempts per 15-minute window (tracked in memory or sessions table)
- No password reset for MVP (admin can manually update via settings or direct DB access)
- Admin session check via Next.js middleware on all `/admin/*` routes (except `/admin/login`)

### API Security

- All admin API routes check for valid session token
- Zod validation on ALL request inputs (body, query params, route params)
- CSRF protection via custom header check (X-CSRF-Token) on state-changing admin API routes
- Public API routes (`/api/stories`) are read-only, no authentication required
- No API keys exposed to client-side code
- `.env` file has 600 permissions, outside public_html

### Data Security

- SQLite database outside public_html → not web-accessible
- Backup files outside public_html → not web-accessible
- Log files outside public_html → not web-accessible
- API keys stored ONLY in `.env` (never in database)
- `apiKeyRef` in source records is just a reference name (e.g., "NEWSAPI_KEY"), not the key itself

### Web Security Headers

Configured via Next.js `headers()` in `next.config.ts` or middleware:

```
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
Referrer-Policy: strict-origin-when-cross-origin
Content-Security-Policy: default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline'; img-src 'self' data: https:; connect-src 'self'
Strict-Transport-Security: max-age=31536000; includeSubDomains
```

### Additional Protections

- PM2 runs Next.js as the cPanel user (not root)
- cPanel user has limited filesystem permissions
- Apache reverse proxy limits direct access to Node.js port
- Database file permissions: 644 (owner read/write, group read)
- `.env` file permissions: 600 (owner read/write only)
- SSH key authentication for server access (disable password SSH)

---

## 26. Legal/Disclaimer Considerations

### Required Pages

| Page | Content |
|------|---------|
| About | What HeadlineSift is, how it works, source methodology |
| Contact | Contact form or email for inquiries, content removal requests |
| Privacy Policy | Data collection (minimal — no user accounts, no tracking cookies for MVP), hosting info, log data |
| Terms of Use | Acceptable use, intellectual property, limitation of liability |
| Disclaimer | AI limitations, not professional advice, source accuracy |
| Source Attribution Policy | How sources are credited, link-back policy |
| Content Removal Request | Process for publishers to request content/attribution changes |
| AI Summary Disclaimer | Clear disclosure that summaries are AI-generated and may contain errors |

### AI Disclaimer Text (Required on Every Story Card)

```
AI-Generated Summary
This summary and analysis were generated by AI based on available source material.
It may contain inaccuracies. Always read the original sources before making decisions.
```

### Category-Specific Disclaimer Text (On Story Cards)

#### Health
```
⚠ Health Information Disclaimer
This content is for informational purposes only and is not a substitute for professional
medical advice, diagnosis, or treatment. Always seek the advice of your physician or
other qualified health provider.
```

#### Finance
```
⚠ Financial Information Disclaimer
This content is for informational purposes only and does not constitute financial,
investment, or trading advice. Past performance is not indicative of future results.
Consult a qualified financial advisor before making investment decisions.
```

#### Education
```
⚠ Education Information Disclaimer
This content is for informational purposes. Always verify deadlines, exam dates, and
admission requirements with official education board or institution websites.
```

#### Breaking News
```
⚠ Developing Story
This is a developing story. Details may change as more information becomes available.
Check original sources for the latest updates.
```

### Source Attribution Requirements

- Every story card MUST list all source names
- Every story card MUST link to at least the primary (canonical) original source
- Source names link to the original article URL (not the source homepage)
- "Read original" button links to the canonical source article
- HeadlineSift does NOT reproduce full article content — only AI-generated summaries

### Fair Use / Content Limits

- Only headlines and short snippets (max 200 chars) are stored from original sources
- Full article text is NEVER stored or displayed
- AI summaries are original text generated from the snippets
- Always attribute and link back

### Future Legal Considerations (Post-MVP)

- DMCA compliance for US hosting
- GDPR compliance if EU users are targeted
- Cookie consent if analytics are added
- Publisher opt-out mechanism
- Commercial use licenses for news APIs

---

## 27. MVP Milestones

### Milestone 1: Project Setup & Database (Week 1)

**Goal:** Project initialized, database running, admin can log in and manage reference data.

- [ ] Initialize Next.js project with TypeScript and Tailwind CSS
- [ ] Set up Prisma with SQLite provider
- [ ] Design and create full database schema (all 10 tables)
- [ ] Run initial migration
- [ ] Set up directory structure (`data/`, `backups/`, `logs/`)
- [ ] Create `prisma.ts` singleton with WAL mode pragmas
- [ ] Create `.env` with DATABASE_URL
- [ ] Create seed script with 5 countries, 7 categories, 100 sources, default settings
- [ ] Run seed
- [ ] Build admin login page and API
- [ ] Implement session-based auth middleware
- [ ] Build admin layout with sidebar navigation
- [ ] Build Countries CRUD (page + API)
- [ ] Build Categories CRUD (page + API)
- [ ] Build Settings page (read/write all settings)
- [ ] Build reusable DataTable component
- [ ] Set up PM2 configuration
- [ ] Configure cPanel reverse proxy or Node.js app
- [ ] Verify app runs on the domain with PM2

**Deliverable:** Admin can log in, manage countries, categories, and settings via the admin panel.

### Milestone 2: Source Management & Fetching (Week 2)

**Goal:** Admin manages sources; system fetches articles from RSS/API feeds.

- [ ] Build Sources CRUD (page + API)
- [ ] Build Source Mappings CRUD + bulk create (page + API)
- [ ] Build RSS feed parser (`providers/news/rss.ts`)
- [ ] Build REST API news source (`providers/news/api.ts`)
- [ ] Build article normalizer (hash computation, field normalization)
- [ ] Build exact deduplication (Layer 1: URL/title/content hash)
- [ ] Build `fetch-all.ts` script
- [ ] Build `fetch-source.ts` script
- [ ] Build `scripts/fetch-all.ts` integration with job system
- [ ] Build fetch log creation and update logic
- [ ] Build Fetch Logs viewer page
- [ ] Build Jobs list page
- [ ] Configure cPanel cron for fetch-all (every 30 min)
- [ ] Test: manual fetch from admin triggers successful RSS fetch
- [ ] Test: cron fetch works end-to-end

**Deliverable:** System fetches articles from active RSS/API sources on schedule. Admin can view articles, fetch logs, and job status.

### Milestone 3: Clustering & Ranking (Week 3)

**Goal:** Duplicates removed, stories grouped, clusters ranked.

- [ ] Build TF-IDF vectorization utility
- [ ] Build cosine similarity computation
- [ ] Build clustering service (`clusterer.ts`)
- [ ] Build `cluster-stories.ts` script
- [ ] Build near-duplicate detection (Layer 2: title similarity)
- [ ] Build ranking engine (`ranker.ts`) with all scoring factors
- [ ] Build `rank-stories.ts` script
- [ ] Build Ranking Rules config page (read/edit weights)
- [ ] Build Story Clusters list page with filters
- [ ] Build Story Cluster detail page with rank breakdown
- [ ] Configure cPanel cron for cluster and rank scripts
- [ ] Test: new articles are clustered correctly
- [ ] Test: clusters are ranked with visible score breakdowns
- [ ] Test: ranking weights can be changed and take effect

**Deliverable:** Articles are grouped into story clusters and ranked. Admin can view clusters and understand why each is scored as it is.

### Milestone 4: AI Analysis & Moderation (Week 4)

**Goal:** AI generates analysis; admin reviews and publishes stories.

- [ ] Build AI provider abstraction (`providers/ai/index.ts`)
- [ ] Build Anthropic provider
- [ ] Build OpenAI provider
- [ ] Build AI prompt builder with category safety rules
- [ ] Build AI result validator (Zod schema)
- [ ] Build `ai-analyze.ts` script
- [ ] Build AI cost control (batch limits, caching, skip-if-unchanged)
- [ ] Build Review Queue page
- [ ] Build cluster approve/reject/hide/publish actions
- [ ] Build AI analysis viewer and inline editor
- [ ] Build batch approve/reject
- [ ] Build auto-publish logic (for low-risk categories)
- [ ] Build review queue filters (by category, confidence, impact)
- [ ] Configure cPanel cron for AI analysis script
- [ ] Test: AI generates analysis for top clusters
- [ ] Test: admin can review, edit, approve, and reject
- [ ] Test: auto-publish works for configured categories

**Deliverable:** AI generates analysis for top clusters. Admin can review and moderate all content before it goes public.

### Milestone 5: Public Launch (Week 5)

**Goal:** Public page is live with clean UI and production readiness.

- [ ] Build public homepage (`/`) with story cards
- [ ] Build FilterBar component (country, category, sort, time, confidence, impact)
- [ ] Build StoryCard component with all fields
- [ ] Build ImpactBadge, ConfidenceBadge, SourceList components
- [ ] Build LoadingSkeleton, EmptyState, ErrorState components
- [ ] Build public API route (`/api/stories`) with filtering
- [ ] Add SEO meta tags
- [ ] Build legal pages (About, Contact, Privacy, Terms, Disclaimer)
- [ ] Add AI disclaimer to every story card
- [ ] Add category-specific disclaimers
- [ ] Build footer with legal links
- [ ] Responsive design testing (mobile, tablet, desktop)
- [ ] Build CSV export (admin)
- [ ] Build CSV import with preview (admin)
- [ ] Build SQLite backup page (admin)
- [ ] Build `backup.ts` script
- [ ] Build `cleanup.ts` script
- [ ] Build `health-check.ts` script
- [ ] Configure all cPanel cron jobs
- [ ] Performance testing (page load < 2s)
- [ ] Security review
- [ ] Final deployment and smoke testing

**Deliverable:** HeadlineSift.com is live. Public visitors can browse filtered, AI-analyzed headlines. Admin has full control.

---

## 28. Folder Structure Recommendation

See Section 9 for the complete directory tree. Summary of key paths:

| Path | Purpose |
|------|---------|
| `/home/headlinesift/headlinesift/` | Application root |
| `/home/headlinesift/headlinesift/data/headlinesift.db` | SQLite database |
| `/home/headlinesift/headlinesift/backups/` | Database backups |
| `/home/headlinesift/headlinesift/logs/` | Application logs |
| `/home/headlinesift/headlinesift/.env` | Environment variables |
| `/home/headlinesift/headlinesift/src/` | Application source code |
| `/home/headlinesift/headlinesift/scripts/` | CLI scripts for cron |
| `/home/headlinesift/headlinesift/prisma/` | Prisma schema and migrations |
| `/home/headlinesift/public_html/` | Apache document root (minimal) |

---

## 29. Environment Variables

### `.env` File

```bash
# Database
DATABASE_URL="file:/home/headlinesift/headlinesift/data/headlinesift.db"

# Application
NODE_ENV=production
NEXT_PUBLIC_SITE_URL=https://headlinesift.com
NEXT_PUBLIC_SITE_NAME=HeadlineSift

# Admin (initial password set via seed; bcrypt hash stored in DB)
# NEXTAUTH_SECRET is used for session token signing
NEXTAUTH_SECRET=<random-64-char-string>

# AI Providers (at least one required)
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GOOGLE_AI_API_KEY=...

# Default AI provider: anthropic | openai | google
AI_PROVIDER=anthropic
AI_MODEL=claude-sonnet-4-6

# News API Keys (optional — for API-type sources)
NEWSAPI_KEY=...
MEDIASTACK_KEY=...
GNEWS_KEY=...

# Logging
LOG_LEVEL=info

# PM2
PM2_HOME=/home/headlinesift/.pm2
```

### `.env.example`

A template file (without secrets) committed to the repository for reference.

---

## 30. Acceptance Criteria

### The MVP is ready when ALL of the following are true:

#### Admin
- [ ] Admin can log in with a password
- [ ] Admin session expires after 24 hours
- [ ] Admin can add, edit, activate, and deactivate countries
- [ ] Admin can add, edit, activate, and deactivate categories
- [ ] Admin can add, edit, activate, and deactivate sources
- [ ] Admin can map sources to countries and categories
- [ ] Admin can trigger a fetch for an individual source
- [ ] Admin can view raw fetched articles with filters
- [ ] System fetches articles from RSS and API sources on cron schedule
- [ ] System removes exact duplicate articles before saving
- [ ] System detects near-duplicate titles
- [ ] System groups similar articles into story clusters
- [ ] System ranks clusters with a visible score breakdown
- [ ] System selects top N stories per category/country
- [ ] AI generates summary, why it matters, positive impact, negative impact, affected groups, impact level, confidence level, risk warning, and display headline
- [ ] AI analysis is cached and not regenerated unnecessarily
- [ ] AI analysis respects category-specific safety rules
- [ ] Admin can review, edit, approve, reject, and hide story clusters
- [ ] Admin can batch approve or reject stories
- [ ] Auto-publish works for configured low-risk categories
- [ ] Admin can view fetch logs with errors and success metrics
- [ ] Admin can view job status and progress
- [ ] Admin can configure ranking rule weights
- [ ] Admin can export sources, categories, and mappings as CSV
- [ ] Admin can import sources from CSV with validation and preview
- [ ] Admin can create and download SQLite backups
- [ ] Admin dashboard shows accurate metrics
- [ ] Source health monitoring auto-pauses failing sources
- [ ] Cleanup script runs daily and prunes old data
- [ ] Source "Test Feed" button verifies RSS/API endpoint

#### Public
- [ ] Public homepage loads in < 2 seconds
- [ ] Public page shows published story cards with all fields
- [ ] Visitors can filter by country
- [ ] Visitors can filter by category
- [ ] Visitors can sort by Top Ranked, Latest, Most Covered, High Impact
- [ ] Visitors can filter by time window
- [ ] Visitors can filter by confidence and impact level
- [ ] Each story card shows: display headline, AI summary, why it matters, positive impact, negative impact, impact level, confidence level, source count, source names, last updated, "Read original" link
- [ ] Each story card shows AI disclaimer
- [ ] Health, finance, and breaking news stories show category-specific disclaimers
- [ ] "Read original" links open source articles in a new tab
- [ ] Legal pages exist: About, Contact, Privacy, Terms, Disclaimer
- [ ] Responsive design works on mobile (375px), tablet (768px), and desktop (1280px+)
- [ ] Empty state shows helpful message when no stories match filters
- [ ] Error state shows retry option when loading fails
- [ ] SEO meta tags are present and dynamic

#### System
- [ ] SQLite runs in WAL mode
- [ ] Database is outside public_html
- [ ] `.env` is outside public_html with 600 permissions
- [ ] Backups are outside public_html
- [ ] Logs are outside public_html
- [ ] Write conflicts are handled with retry logic
- [ ] Daily backup runs via cron
- [ ] Daily cleanup runs via cron
- [ ] Source health check runs every 6 hours
- [ ] PM2 auto-restarts the app on crash
- [ ] PM2 auto-starts on server reboot
- [ ] Cron jobs are configured and running
- [ ] No external services required (self-contained on VPS)
- [ ] All API routes validate input with Zod
- [ ] Admin routes are protected by session auth
- [ ] Rate limiting on login attempts
- [ ] Security headers are set

---

## Appendix A: Category Configuration Reference

| Category | Slug | Level | AI Safety | Auto-Publish | Max Stories | Fetch Freq |
|----------|------|-------|-----------|-------------|-------------|------------|
| Technology | technology | global | low | true | 50 | 30–60 min |
| Health & Wellness | health | global | critical | false | 50 | 1–3 hours |
| Science & Space | science | global | medium | true | 50 | 3–6 hours |
| Finance & Stock Market | finance | country | high | false | 50 | 5–15 min (market hours) |
| Education & Careers | education | country | high | false | 50 | 1–6 hours |
| Business & Economy | business | country | medium | false | 50 | 30 min |
| Breaking News | breaking | country | critical | false | 50 | 5–10 min |

## Appendix B: Country Configuration Reference

| Name | Code | Region | Language | Is Global Option |
|------|------|--------|----------|-----------------|
| Global | GLOBAL | Global | en | true |
| India | IN | Asia | en,hi | false |
| United States | US | North America | en | false |
| United Kingdom | GB | Europe | en | false |
| Japan | JP | Asia | ja,en | false |

## Appendix C: Default Source Trust Scores

| Source Type | Default Trust |
|-------------|--------------|
| Official government/agency | 10 |
| Central bank | 10 |
| Stock exchange | 10 |
| Regulator | 10 |
| Major wire service (AP, Reuters, AFP) | 9 |
| Major national newspaper | 8 |
| Major TV/news network | 8 |
| Specialist publication | 7 |
| Niche news website | 5 |
| Blog / independent | 3 |
| Unknown | 2 |

---

*End of Technical Product Specification — HeadlineSift.com MVP v1.0.0*