External Data Integration - Auto-AI for Django Models
Overview
Transform any Django model into an AI-searchable knowledge source with a single mixin. The ExternalDataMixin provides automatic vectorization, semantic search, and AI chat integration with zero additional configuration.
Philosophy: “One line of code, infinite AI possibilities” - Add the mixin to your model and get automatic AI integration that updates in real-time as your data changes.
TAGS: mixin, auto-integration, vectorization, real-time, django-models, ai-search
Modules
@knowbase/mixins/ExternalDataMixin
Purpose: Automatic AI integration for Django models with real-time vectorization and semantic search capabilities.
Dependencies:
django_cfg.apps.knowbase.models.ExternalData- Django signals (
post_save,post_delete) - ReArq background tasks
- OpenAI embeddings API
Exports:
ExternalDataMixin- Main mixin classExternalDataMeta- Configuration class- Auto-generated fields and methods
Used in:
- E-commerce product catalogs
- User profiles and content
- Documentation systems
- Any Django model requiring AI search
Tags: mixin, signals, auto-sync, vectorization
Advanced Configuration
class Article(ExternalDataMixin, models.Model):
title = models.CharField(max_length=200)
content = models.TextField()
author = models.ForeignKey(User, on_delete=models.CASCADE)
tags = models.ManyToManyField('Tag')
published_at = models.DateTimeField(auto_now_add=True)
class ExternalDataMeta:
watch_fields = ['title', 'content', 'tags']
similarity_threshold = 0.6
auto_sync = True
is_public = True # Searchable by all users
# Optional: Custom source type
source_type = ExternalDataType.CUSTOM
def get_external_content(self):
# Include related data in content
tag_names = ", ".join(self.tags.values_list('name', flat=True))
return f"""# {self.title}
**Author**: {self.author.get_full_name()}
**Published**: {self.published_at.strftime('%Y-%m-%d')}
**Tags**: {tag_names}
## Content
{self.content}
"""
def get_external_metadata(self):
# Custom metadata for search filtering
return {
'author_id': self.author.id,
'author_name': self.author.get_full_name(),
'tag_count': self.tags.count(),
'word_count': len(self.content.split()),
'published_year': self.published_at.year
}Manual Control Methods
class Document(ExternalDataMixin, models.Model):
title = models.CharField(max_length=200)
content = models.TextField()
class ExternalDataMeta:
watch_fields = ['title', 'content']
auto_sync = False # Manual control
def get_external_content(self):
return f"# {self.title}\n\n{self.content}"
def publish(self):
"""Custom method that triggers AI sync"""
self.is_published = True
self.save()
# Manually trigger AI sync
self.sync_to_external_data()
def archive(self):
"""Remove from AI search"""
self.is_archived = True
self.save()
# Remove from AI system
self.remove_from_external_data()%%END%%
---
## Data Models (Pydantic 2 & TypeScript)
### Pydantic 2 Models (Backend)
```python
from pydantic import BaseModel, Field
from typing import Optional, List, Dict, Any
from enum import Enum
class ExternalDataType(str, Enum):
MODEL = "model"
API = "api"
CUSTOM = "custom"
class ExternalDataMetaConfig(BaseModel):
"""Configuration for ExternalDataMixin"""
watch_fields: List[str] = Field(..., min_items=1)
similarity_threshold: float = Field(0.5, ge=0.0, le=1.0)
auto_sync: bool = True
is_public: bool = False
source_type: ExternalDataType = ExternalDataType.MODEL
class ExternalDataSyncRequest(BaseModel):
"""Request to sync model data to external data system"""
model_name: str = Field(..., min_length=1)
model_id: str = Field(..., min_length=1)
force_update: bool = False
class ExternalDataSyncResponse(BaseModel):
"""Response from sync operation"""
success: bool
external_data_id: Optional[str] = None
message: str
processing_time: float
chunks_created: int = 0
```
### TypeScript Interfaces (Frontend)
```typescript
export enum ExternalDataType {
MODEL = "model",
API = "api",
CUSTOM = "custom"
}
export interface ExternalDataMetaConfig {
watch_fields: string[];
similarity_threshold: number;
auto_sync: boolean;
is_public: boolean;
source_type: ExternalDataType;
}
export interface ExternalDataSyncRequest {
model_name: string;
model_id: string;
force_update: boolean;
}
export interface ExternalDataSyncResponse {
success: boolean;
external_data_id?: string;
message: string;
processing_time: number;
chunks_created: number;
}
// Model integration status
export interface ModelIntegrationStatus {
model_name: string;
total_instances: number;
synced_instances: number;
pending_sync: number;
last_sync: string;
sync_enabled: boolean;
}
```
---
## 🔁 Flows
### Automatic Sync Flow (Real-time)
1. **Model Save** → Django model with mixin is saved/updated
2. **Signal Detection** → `post_save` signal detects changes in watched fields
3. **Change Validation** → Compare current values with previous state
4. **Content Generation** → Call model's `get_external_content()` method
5. **Background Task** → Queue `sync_external_data_async` task
6. **Content Processing** → Generate embeddings and create chunks
7. **Vector Storage** → Store in ExternalData with updated embeddings
8. **Search Index** → Update semantic search indexes
**Modules**:
- `ExternalDataMixin` - signal registration
- `external_data_signals.py` - change detection
- `external_data_tasks.py` - background processing
- `ExternalDataService` - content processing
---
### Manual Sync Flow (On-demand)
1. **Manual Trigger** → Developer calls `instance.sync_to_external_data()`
2. **Content Generation** → Generate fresh content from model
3. **Immediate Processing** → Process synchronously or queue task
4. **Status Update** → Update `external_source_id` field
5. **Confirmation** → Return success/failure status
**Modules**:
- `ExternalDataMixin.sync_to_external_data()` method
- `ExternalDataService.create_or_update()` method
---
### Bulk Sync Flow (Management Commands)
1. **Command Execution** → Run `python manage.py sync_external_models`
2. **Model Discovery** → Find all models using ExternalDataMixin
3. **Batch Processing** → Process models in configurable batches
4. **Progress Tracking** → Show sync progress and statistics
5. **Error Handling** → Log failures and continue processing
6. **Summary Report** → Display final sync statistics
**Modules**:
- Management command `sync_external_models`
- `ExternalDataService.bulk_sync()` method
---
## Advanced Patterns
### Conditional Sync
```python
class BlogPost(ExternalDataMixin, models.Model):
title = models.CharField(max_length=200)
content = models.TextField()
status = models.CharField(max_length=20, choices=[
('draft', 'Draft'),
('published', 'Published'),
('archived', 'Archived')
])
class ExternalDataMeta:
watch_fields = ['title', 'content', 'status']
auto_sync = True
def should_sync_to_external_data(self):
"""Override to control when sync happens"""
return self.status == 'published'
def get_external_content(self):
if self.status != 'published':
return None # Don't sync non-published posts
return f"# {self.title}\n\n{self.content}"
```
### Multi-language Content
```python
class MultiLanguageArticle(ExternalDataMixin, models.Model):
title_en = models.CharField(max_length=200)
title_es = models.CharField(max_length=200)
content_en = models.TextField()
content_es = models.TextField()
class ExternalDataMeta:
watch_fields = ['title_en', 'title_es', 'content_en', 'content_es']
auto_sync = True
def get_external_content(self):
"""Generate multi-language content"""
return f"""# {self.title_en}
## English
{self.content_en}
## Español
# {self.title_es}
{self.content_es}
"""
def get_external_metadata(self):
return {
'languages': ['en', 'es'],
'primary_language': 'en'
}
```
### Related Data Integration
```python
class Product(ExternalDataMixin, models.Model):
name = models.CharField(max_length=100)
description = models.TextField()
category = models.ForeignKey('Category', on_delete=models.CASCADE)
reviews = models.ManyToManyField('Review', blank=True)
class ExternalDataMeta:
watch_fields = ['name', 'description', 'category']
auto_sync = True
def get_external_content(self):
# Include related data in content
recent_reviews = self.reviews.filter(
created_at__gte=timezone.now() - timedelta(days=30)
).order_by('-rating')[:5]
review_text = "\n".join([
f"- {review.comment} (Rating: {review.rating}/5)"
for review in recent_reviews
])
return f"""# {self.name}
**Category**: {self.category.name}
## Description
{self.description}
## Recent Reviews
{review_text}
"""
```
---
## ⚠️ Anti-patterns to Avoid
### ❌ Heavy Content Generation
**Don't do this**:
```python
def get_external_content(self):
# Expensive operations in content generation
related_data = self.expensive_related_query() # Slow!
processed_content = self.complex_processing() # CPU intensive!
return f"Heavy content: {related_data} {processed_content}"
```
**Do this instead**:
```python
def get_external_content(self):
# Keep content generation fast and simple
return f"# {self.title}\n\n{self.description}"
# Use background tasks for heavy processing
def process_heavy_content(self):
# Queue background task for expensive operations
process_related_data_async.send(self.id)
```
### ❌ Watching Too Many Fields
**Don't do this**:
```python
class ExternalDataMeta:
# Watching every field causes unnecessary updates
watch_fields = ['field1', 'field2', 'field3', 'field4', 'field5',
'field6', 'field7', 'field8', 'field9', 'field10']
```
**Do this instead**:
```python
class ExternalDataMeta:
# Only watch fields that affect search relevance
watch_fields = ['title', 'description'] # Core content only
```
### ❌ Ignoring Performance
**Don't do this**:
```python
def get_external_content(self):
# N+1 queries in content generation
reviews = []
for review in self.reviews.all(): # N+1 problem!
reviews.append(f"{review.user.name}: {review.comment}")
return "\n".join(reviews)
```
**Do this instead**:
```python
def get_external_content(self):
# Optimized queries with select_related/prefetch_related
reviews = self.reviews.select_related('user').values_list(
'user__name', 'comment', flat=False
)
review_text = "\n".join([f"{name}: {comment}" for name, comment in reviews])
return f"# {self.title}\n\n{review_text}"
```
---
## Version Tracking
- `ADDED_IN: v1.1` - Initial ExternalDataMixin implementation
- `ADDED_IN: v1.2` - Conditional sync with `should_sync_to_external_data()`
- `ADDED_IN: v1.3` - Manual control methods (`sync_to_external_data()`, `remove_from_external_data()`)
- `ADDED_IN: v1.4` - Per-object similarity thresholds
- `CHANGED_IN: v1.5` - Improved signal handling and performance optimization
---
## Quick Integration Checklist
### Basic Setup
- [ ] Add `ExternalDataMixin` to your model
- [ ] Define `ExternalDataMeta` class with `watch_fields`
- [ ] Implement `get_external_content()` method
- [ ] Test with a simple model instance
### Advanced Configuration
- [ ] Set appropriate `similarity_threshold` for your content type
- [ ] Configure `is_public` based on your security requirements
- [ ] Implement custom `get_external_title()` and `get_external_description()`
- [ ] Add relevant metadata with `get_external_metadata()`
### Production Readiness
- [ ] Ensure background workers are running
- [ ] Monitor sync performance and adjust batch sizes
- [ ] Set up error monitoring for failed syncs
- [ ] Test with realistic data volumes
### Verification
- [ ] Model changes trigger automatic sync
- [ ] Content appears in semantic search results
- [ ] AI chat includes model data in responses
- [ ] Admin interface shows sync status
---
**DEPENDS_ON**: [django_cfg.apps.knowbase, Django signals, ReArq, OpenAI API]
**USED_BY**: [Product catalogs, User profiles, Content management, Documentation systems]
**TAGS**: `mixin, auto-integration, real-time-sync, django-models, ai-search`