Vector Database Semantic Linking System
Complete Technical Build Specification
📋 Project Overview
This document outlines everything required to build a production-ready vector database semantic linking system for WordPress law firm websites. Estimated timeline: 6-8 weeks for MVP, 12 weeks for full production deployment.
1️⃣ Technical Stack Requirements
Core Technologies
WordPress Environment
- WordPress: 6.0+ (tested up to latest)
- PHP: 8.0+ (for modern syntax and performance)
- MySQL/MariaDB: 5.7+ / 10.3+ (for metadata storage)
- Server: Apache or Nginx with mod_rewrite
- Memory: 256MB PHP memory minimum (512MB recommended)
Vector Database (Choose One)
Option A: Pinecone (Recommended for Quick Start)
- Free tier: 100K vectors, 1 index
- Starter: $70/month (5M vectors, 1 pod)
- No infrastructure management required
- Simple REST API
Option B: PostgreSQL + pgvector
- PostgreSQL 12+ with pgvector extension
- Self-hosted or managed (AWS RDS, Supabase, etc.)
- More control, lower long-term costs
- Requires database administration
Option C: Weaviate
- Docker or cloud-hosted
- Built-in OpenAI integration
- Hybrid search capabilities
- More complex but very powerful
Embedding API
- OpenAI API: text-embedding-3-large or text-embedding-3-small
- Cost: ~$0.13 per 1M tokens (3-large) or $0.02 per 1M tokens (3-small)
- Dimensions: 3072 (3-large) or 1536 (3-small)
- Alternative: Cohere, Voyage AI, or open-source models
Development Tools
- Git: Version control
- Composer: PHP dependency management
- npm/yarn: For any JavaScript components
- WP-CLI: WordPress command-line interface (optional but helpful)
- Postman/Insomnia: API testing
2️⃣ Required Development Skills & Resources
Team Composition
| Role | Skills Required | Time Commitment |
|---|---|---|
| Backend Developer | PHP, WordPress plugin development, REST API integration, database design | 4-6 weeks full-time |
| Database Engineer | PostgreSQL or vector DB experience, query optimization, indexing strategies | 2-3 weeks part-time |
| Frontend Developer | JavaScript, React (optional), WordPress admin UI, AJAX | 2-3 weeks part-time |
| DevOps Engineer | Server configuration, database hosting, API security, monitoring | 1-2 weeks setup + ongoing |
| QA/Testing | WordPress testing, API testing, performance benchmarking | 2-3 weeks part-time |
💡 Alternative: A single senior full-stack developer with WordPress and API experience can handle this project solo over 8-12 weeks, but team collaboration speeds development significantly.
3️⃣ WordPress Plugin Architecture
Plugin Structure
semantic-linking/
├── semantic-linking.php # Main plugin file
├── includes/
│ ├── class-vector-db.php # Vector database abstraction layer
│ ├── class-embedding.php # Embedding API handler
│ ├── class-content-processor.php # Extract & prepare content
│ ├── class-recommender.php # Recommendation engine
│ └── class-admin-ui.php # WordPress admin interface
├── admin/
│ ├── css/
│ │ └── admin-styles.css
│ ├── js/
│ │ └── admin-scripts.js
│ └── views/
│ ├── dashboard.php
│ └── recommendations.php
├── api/
│ └── class-rest-api.php # REST API endpoints
├── cron/
│ └── class-batch-processor.php # Background processing
├── config/
│ └── settings.php # Configuration constants
└── vendor/ # Composer dependencies
└── autoload.php
Core Plugin Components
1. Content Extraction Hooks
// Hook into post save
add_action('save_post', 'sl_process_content', 10, 3);
add_action('post_updated', 'sl_update_vector', 10, 3);
add_action('before_delete_post', 'sl_delete_vector', 10, 1);
function sl_process_content($post_id, $post, $update) {
// Extract title, meta description, content
// Generate embedding
// Store in vector DB
}
2. Vector Database Connector
class Vector_DB {
private $client;
public function upsert($post_id, $vector, $metadata) {}
public function search($vector, $top_k = 10, $filter = []) {}
public function delete($post_id) {}
public function get_stats() {}
}
3. Embedding Generator
class Embedding_Generator {
private $api_key;
private $model = 'text-embedding-3-large';
public function generate($text) {
// Call OpenAI API
// Handle rate limiting
// Return vector array
}
public function batch_generate($texts) {
// Process multiple texts efficiently
}
}
4. Recommendation Engine
class Recommender {
public function get_recommendations($post_id, $options = []) {
// 1. Get post vector from DB
// 2. Query vector DB for similar
// 3. Apply filters (practice area, location, etc.)
// 4. Apply business rules
// 5. Return ranked recommendations
}
private function apply_filters($results, $filters) {}
private function apply_business_rules($results) {}
private function rank_results($results) {}
}
4️⃣ Database Schema & Data Structure
WordPress MySQL Tables
-- Metadata tracking table
CREATE TABLE wp_semantic_linking_meta (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
post_id BIGINT UNSIGNED NOT NULL,
vector_id VARCHAR(255) NOT NULL,
embedding_model VARCHAR(100),
embedding_dimensions INT,
embedding_cost DECIMAL(10,6),
last_embedded DATETIME,
practice_area VARCHAR(100),
location VARCHAR(100),
content_stage VARCHAR(50),
INDEX idx_post_id (post_id),
INDEX idx_vector_id (vector_id),
INDEX idx_practice_area (practice_area),
FOREIGN KEY (post_id) REFERENCES wp_posts(ID) ON DELETE CASCADE
);
-- Recommendation cache table
CREATE TABLE wp_semantic_linking_cache (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
post_id BIGINT UNSIGNED NOT NULL,
recommendations TEXT, -- JSON array
cache_date DATETIME,
INDEX idx_post_id (post_id),
INDEX idx_cache_date (cache_date)
);
-- Analytics tracking
CREATE TABLE wp_semantic_linking_analytics (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
source_post_id BIGINT UNSIGNED,
recommended_post_id BIGINT UNSIGNED,
similarity_score DECIMAL(5,4),
shown_date DATETIME,
accepted BOOLEAN DEFAULT FALSE,
clicked BOOLEAN DEFAULT FALSE,
INDEX idx_source (source_post_id),
INDEX idx_recommended (recommended_post_id)
);
Vector Database Structure (Pinecone Example)
{
"id": "post_123",
"values": [0.023, -0.145, 0.678, ...], // 3072 dimensions
"metadata": {
"post_id": 123,
"title": "Understanding DUI Penalties in California",
"url": "https://example.com/dui-penalties-california",
"post_type": "post",
"practice_area": "criminal-defense",
"location": "california",
"content_stage": "awareness",
"word_count": 1500,
"published_date": "2025-01-15",
"author_id": 5,
"traffic_score": 0.75, // normalized 0-1
"conversion_rate": 0.042
}
}
PostgreSQL + pgvector Schema
-- Install pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Main content embeddings table
CREATE TABLE content_embeddings (
id SERIAL PRIMARY KEY,
post_id BIGINT NOT NULL,
title TEXT,
url TEXT,
embedding vector(3072), -- or 1536 for smaller model
post_type VARCHAR(50),
practice_area VARCHAR(100),
location VARCHAR(100),
content_stage VARCHAR(50),
word_count INTEGER,
published_date TIMESTAMP,
author_id INTEGER,
traffic_score DECIMAL(3,2),
conversion_rate DECIMAL(5,4),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Create index for vector similarity search
CREATE INDEX ON content_embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- Create standard indexes
CREATE INDEX idx_post_id ON content_embeddings(post_id);
CREATE INDEX idx_practice_area ON content_embeddings(practice_area);
CREATE INDEX idx_location ON content_embeddings(location);
5️⃣ API Integration & Security
Required API Accounts & Keys
OpenAI API Setup
- Create account at platform.openai.com
- Add payment method (required for API access)
- Generate API key with embedding permissions
- Set spending limits ($10-50/month typical for medium law firm)
- Store API key in WordPress wp-config.php or environment variables
// In wp-config.php
define('OPENAI_API_KEY', 'sk-...');
Vector Database API Setup (Pinecone Example)
- Sign up at pinecone.io
- Create index with 3072 dimensions, cosine metric
- Generate API key
- Note your environment region
- Store credentials securely
// In wp-config.php
define('PINECONE_API_KEY', 'xxx');
define('PINECONE_ENVIRONMENT', 'us-west1-gcp');
define('PINECONE_INDEX_NAME', 'semantic-linking');
Security Best Practices
🔒 Critical Security Requirements
- Never store API keys in database – use wp-config.php or environment variables
- Implement rate limiting – prevent API abuse and control costs
- Validate all user input – sanitize content before embedding
- Use nonces for AJAX requests – prevent CSRF attacks
- Check user capabilities – ensure only authorized users access admin features
- Encrypt sensitive data in transit – use HTTPS for all API calls
- Implement error handling – don’t expose API errors to front-end users
- Log security events – track API usage and failed requests
Rate Limiting Implementation
class Rate_Limiter {
private $transient_prefix = 'sl_rate_limit_';
private $max_requests = 60; // per minute
public function check_limit($user_id) {
$key = $this->transient_prefix . $user_id;
$count = get_transient($key);
if ($count === false) {
set_transient($key, 1, 60);
return true;
}
if ($count >= $this->max_requests) {
return false;
}
set_transient($key, $count + 1, 60);
return true;
}
}
6️⃣ WordPress Admin Interface
Required Admin Pages
1. Settings Page
- API credentials configuration
- Vector database selection & connection
- Embedding model selection
- Similarity threshold slider (0.0 – 1.0)
- Practice area taxonomy mapping
- Location filtering rules
- Business rules configuration
- Auto-linking enable/disable toggle
2. Dashboard Page
- Total posts embedded count
- Embedding API usage & cost
- Vector DB storage metrics
- Recent embedding activity log
- Recommendation acceptance rates
- Top performing content clusters
- System health status
3. Batch Processing Page
- Bulk embed all posts button
- Filter by post type, category, date range
- Progress bar with estimated completion
- Pause/resume functionality
- Error handling & retry logic
- Export/import embeddings (backup)
4. Post Editor Meta Box
- Show recommendations for current post
- Display similarity scores
- One-click insert link buttons
- Manual regenerate embedding button
- Practice area/location override
- Exclude from recommendations toggle
5. Analytics Page
- Most recommended posts
- Recommendation click-through rates
- Content cluster visualization
- Practice area distribution charts
- User engagement impact metrics
- ROI calculator based on traffic improvements
7️⃣ Step-by-Step Implementation Workflow
Phase 1: Foundation (Weeks 1-2)
- Environment Setup
- Set up local WordPress development environment
- Install required PHP extensions (curl, json)
- Initialize Git repository
- Set up Composer for dependency management
- API Account Creation
- Create OpenAI account & generate API key
- Set up Pinecone account (or PostgreSQL server)
- Configure API credentials in wp-config.php
- Test API connectivity with simple scripts
- Database Design
- Design MySQL tables for metadata
- Create vector database index/collection
- Document data schema
- Create database migration scripts
Phase 2: Core Development (Weeks 3-5)
- Plugin Scaffold
- Create plugin directory structure
- Write main plugin file with activation/deactivation hooks
- Set up autoloading for classes
- Implement settings API
- Content Extraction Module
- Hook into WordPress save_post action
- Extract title, meta description, content
- Clean and prepare text for embedding
- Handle custom post types
- Embedding Generator
- Create OpenAI API client class
- Implement retry logic for failed requests
- Add rate limiting
- Cache embeddings to avoid regeneration
- Vector Database Integration
- Build abstraction layer for vector DB operations
- Implement upsert, search, delete functions
- Handle connection errors gracefully
- Add logging for debugging
Phase 3: Recommendation Engine (Weeks 6-7)
- Similarity Search
- Query vector DB for similar content
- Implement cosine similarity threshold filtering
- Handle edge cases (no results, duplicate posts)
- Business Rules Engine
- Practice area filtering
- Geographic consistency checking
- Content stage matching
- Recency weighting algorithm
- Diversity requirements
- Ranking Algorithm
- Composite scoring combining similarity + business factors
- Normalize scores to 0-100 scale
- Sort and limit results
Phase 4: Admin UI (Weeks 8-9)
- Settings Page
- Build WordPress settings page with sections
- Add form fields for all configuration options
- Implement validation and sanitization
- Test API connections from settings
- Dashboard
- Display statistics widgets
- Create charts for analytics (Chart.js)
- Show recent activity log
- Add system health indicators
- Post Editor Integration
- Create meta box for recommendations
- Build AJAX handler for real-time suggestions
- Add one-click link insertion
- Style interface to match WordPress admin
- Batch Processing UI
- Build bulk embedding interface
- Create progress tracking system
- Implement pause/resume using WP cron
- Add export/import functionality
Phase 5: Testing & Optimization (Weeks 10-12)
- Unit Testing
- Write PHPUnit tests for core functions
- Test API error handling
- Verify recommendation algorithm accuracy
- Performance Testing
- Benchmark embedding generation speed
- Test vector search query performance
- Optimize database queries
- Implement caching where needed
- Integration Testing
- Test with real law firm content
- Verify recommendations make semantic sense
- Test edge cases (very short content, duplicate titles)
- Ensure compatibility with common WordPress plugins
- Security Audit
- Test input sanitization
- Verify nonce validation
- Check capability checks
- Scan for SQL injection vulnerabilities
- Documentation
- Write user documentation
- Create developer documentation
- Document API endpoints
- Prepare troubleshooting guide
8️⃣ Complete Cost Breakdown
One-Time Development Costs
| Item | Details | Cost Range |
|---|---|---|
| Senior Developer | 8-12 weeks @ $75-150/hr | $24,000 – $72,000 |
| Database Engineer | 2-3 weeks part-time @ $100-200/hr | $4,000 – $12,000 |
| Frontend Developer | 2-3 weeks part-time @ $60-120/hr | $2,400 – $7,200 |
| QA/Testing | 2 weeks @ $50-100/hr | $2,000 – $8,000 |
| Project Management | 10% of dev costs | $3,240 – $9,920 |
| Initial Embedding | 500-1000 pages @ OpenAI rates | $2 – $10 |
| TOTAL ONE-TIME | $35,642 – $109,130 |
Monthly Recurring Costs
| Service | Usage | Monthly Cost |
|---|---|---|
| Pinecone (Free Tier) | Up to 100K vectors, 1 index | $0 |
| Pinecone (Starter) | 5M vectors, 1 pod, recommended for growth | $70 |
| PostgreSQL + pgvector | Self-hosted or managed (Supabase, AWS RDS) | $0 – $50 |
| OpenAI Embeddings | ~50-100 new posts/month | $0.50 – $2 |
| Hosting/Infrastructure | Additional server resources if needed | $0 – $100 |
| Monitoring/Logging | Optional services like LogRocket, Sentry | $0 – $50 |
| TOTAL MONTHLY | $0.50 – $272 |
💡 Cost Optimization Tips
- Start with Pinecone free tier (sufficient for most law firms with <1,000 pages)
- Use text-embedding-3-small instead of 3-large to cut embedding costs by 85% (marginal accuracy tradeoff)
- Implement aggressive caching to avoid re-embedding unchanged content
- Consider open-source embedding models for long-term cost savings
- Batch embeddings during off-peak hours to optimize API rate limits
9️⃣ Comprehensive Testing Checklist
✅ Functional Testing
- New post auto-embeds on publish
- Updated post re-embeds correctly
- Deleted post removes vector from DB
- Recommendations appear in post editor
- Similarity scores calculate correctly
- Filters (practice area, location) work as expected
- One-click link insertion functions properly
- Batch processing completes successfully
- Cache invalidation works correctly
⚡ Performance Testing
- Single embedding generation < 2 seconds
- Vector search query < 500ms
- Recommendation display in post editor < 1 second
- Batch processing handles 1000+ posts without timeout
- Memory usage stays under PHP limits
- Database queries optimized (no N+1 issues)
- Caching reduces API calls by 80%+
🔒 Security Testing
- API keys not exposed in client-side code
- Nonces validated on all AJAX requests
- Capability checks enforce permissions
- Input sanitization prevents XSS
- SQL queries use prepared statements
- Rate limiting prevents API abuse
- Error messages don’t leak sensitive info
🎯 Accuracy Testing
- Semantically similar content scores high (>0.75)
- Unrelated content scores low (<0.50)
- Practice area filtering works correctly
- Geographic filtering prevents cross-state links
- Business rules applied consistently
- Manual review of 50 random recommendations confirms relevance
🔄 Compatibility Testing
- Works with Gutenberg editor
- Compatible with Classic Editor plugin
- No conflicts with Yoast SEO / Rank Math
- Works alongside caching plugins (WP Rocket, etc.)
- Compatible with common page builders (Elementor, etc.)
- Tested on PHP 8.0, 8.1, 8.2
- Works with WordPress 6.0, 6.1, 6.2+
🔟 Production Go-Live Checklist
☑️ Pre-Launch
- All tests passing (functional, performance, security)
- Production API keys configured
- Vector database production index created
- Backup strategy in place
- Rollback plan documented
- Monitoring and alerts configured
- Team training completed
☑️ Launch Day
- Deploy plugin to production
- Run initial batch embedding (off-peak hours)
- Verify all connections working
- Test sample recommendations on live site
- Enable monitoring dashboards
- Announce to content team
☑️ Post-Launch (First Week)
- Monitor error logs daily
- Track API usage and costs
- Review recommendation acceptance rates
- Gather user feedback from content team
- Adjust similarity thresholds if needed
- Document any issues and resolutions
☑️ Ongoing (Monthly)
- Review analytics dashboard
- Optimize business rules based on data
- Monitor vector DB storage usage
- Review and optimize API costs
- Update documentation as needed
- Plan feature enhancements
📊 Project Summary
| Timeline | 8-12 weeks for full production deployment |
| Development Cost | $35,000 – $110,000 (varies by team, complexity) |
| Monthly Recurring | $0.50 – $270 (depending on scale, free tier often sufficient) |
| Required Skills | PHP, WordPress, API integration, database design, vector DB experience |
| Key Dependencies | OpenAI API, Vector Database (Pinecone/pgvector), WordPress 6.0+ |
| Expected ROI Timeline | 4-6 months through improved engagement, traffic, conversions |
🎯 Success Factors
- Start with proven vector DB (Pinecone) for fastest implementation
- Implement comprehensive error handling and logging from day one
- Test with real law firm content during development
- Involve content team early for feedback on recommendations
- Monitor costs closely in first month, optimize as needed
- Document everything for long-term maintainability
- Plan for iterative improvements based on usage data