Atlas Crawler Project Profile

Platform Capabilities

01
🤖
LLM-Powered Source Discovery
Claude Sonnet generates 18–28 curated seed URLs from a plain-English project brief. Supports 13 source categories: government, regulatory, news, commercial, blog, marketplace, competitor, academic, forum, and more.
🕷️
Deep Async BFS Crawler
Breadth-first scraping with configurable page depth (default 2) and page limits (default 50). Path-prefix locking keeps crawls on-topic. 15-minute runtime cap with persistent frontier for resume-on-next-run.
🔄
Content-Hash Refresh Detection
SHA-256 hashing detects when page content changes. Stale chunks are automatically deleted and re-indexed. Unchanged pages are skipped. Tracks new, refreshed, and unchanged counts per job.
🔍
TF-IDF Retrieval + RAG Chat
Per-project TF-IDF index with bigram support and cosine similarity scoring. Bonus scoring for query-term density. Top-K chunks injected into LLM context for grounded, citation-backed answers.
📦
Per-Project Isolation
Every project has its own MongoDB collections for sources, documents, chunks, scrape jobs, and chat sessions. Clean deletion cascade. 90-day auto-expiry with hourly background reaper.
📤
JSON Knowledgebase Export
Full portable export of every project's documents and chunks as downloadable JSON. Import/export API with job tracking. Bulletproof browser-anchor download pattern — no popup blockers.
🔧
Admin Maintenance Tooling
Deduplication endpoint removes duplicate documents by source+URL key. Purge-off-scope cleans documents that fall outside current path-prefix rules. Manual index rebuild endpoint for ops teams.
🌐
Multi-LLM Model Selection
Per-chat model switching between Claude Sonnet 4.5, GPT-5.2, and Gemini 3 Pro. Powered by emergentintegrations LlmChat abstraction. Easily extensible to new providers.
🖥️
React 19 Frontend
Full multi-page SPA: Projects dashboard, new-project wizard, per-project admin panel, and chat interface. Built with shadcn/ui, Tailwind CSS, lucide-react icons, and Sonner toast notifications.

Technical Architecture

02
Backend · Python
FastAPI
Async REST API with Pydantic v2 models. APIRouter with /api prefix, CORS middleware, background tasks via asyncio.
Motor
Async MongoDB driver. Per-collection queries scoped by project_id. Cascade deletes, count queries, cursor-based pagination.
httpx
Async HTTP client for the crawler. 20s timeout, redirect-following, respects HTML content-type gates.
BeautifulSoup4
lxml parser. Strips nav/footer/script/aside noise. Extracts main/article content. Link extraction with tracking-param stripping.
scikit-learn
TfidfVectorizer with bigrams, stop-word removal, 50k feature cap. Cosine similarity for per-project chunk retrieval.
Pydantic v2
Strict request/response validation. ProjectBrief, Source, Document, Chunk, ScrapeJob, ChatRequest models.
pytest
Integration test suite for stats, sources, scrape lifecycle, index rebuild, search, and LLM chat endpoints.
Frontend · React / JS
React 19
Latest React with hooks. react-router-dom v7 for multi-page SPA. useState/useEffect for async data management.
shadcn/ui
Full Radix UI component set: dialogs, dropdowns, badges, tabs, progress, toasts, accordion, command palette, and more.
Tailwind CSS
Utility-first styling with custom design tokens. craco config for path aliasing. PostCSS + autoprefixer pipeline.
axios
Typed API client (api.js) wrapping all backend endpoints. Project-scoped methods, error handling, toast integration.
lucide-react
Icon library. Loader2 spinners, Sparkles for AI actions, ExternalLink, ChevronLeft for navigation context.
sonner
Toast notification system. Long-duration toasts for async discovery (60s), success/error feedback on all mutations.
craco
Custom CRA config for webpack overrides, path aliases (@/components), and the custom health-check plugin.
Data · Infrastructure
MongoDB
8 collections: projects, sources, documents, chunks, scrape_jobs, chat_sessions, messages, import_jobs. Motor async driver.
Claude Sonnet
Source discovery (structured JSON output) + RAG chat. emergentintegrations.LlmChat abstraction with session management.
OpenAI GPT
GPT-5.2 as switchable model option. Same LlmChat interface, per-session conversation history management.
Gemini Pro
Gemini 3 Pro as third model option. google-genai + google-generativeai SDK support in requirements.
uvicorn
ASGI server for FastAPI. Background task execution for crawl jobs and hourly reaper via asyncio.create_task.
Developer Experience
URL normalization
Strips UTM/tracking params, normalizes port, trailing slash, and scheme. Prevents duplicate indexing of canonically identical URLs.
Path-prefix scoping
Per-source allowed_path_prefixes list. LLM discovery auto-suggests path scopes. User-editable via Admin panel.
Chunking engine
700-word target chunks with 100-word overlap. Paragraph-aware splitting. Long paragraphs split with sliding window.
WAF-safe DELETE
POST alias /delete endpoint for environments where DELETE is blocked at the ingress or WAF layer.
Polite crawling
AtlasCrawler/1.0 User-Agent. 250ms inter-request delay. Skips non-200, non-HTML, and binary file extensions.

Real-World Use Cases

03
Competitive Intelligence
Market Research & Competitor Analysis
Crawl competitor websites, industry blogs, and marketplace directories. Ask "What pricing strategies are competitors using in Saskatchewan?" and get cited answers from the actual pages — not hallucinations.
commercial
competitor
marketplace
blog
Legal & Regulatory
Compliance Knowledge Base
Index government portals, regulatory agencies, and income tax acts across jurisdictions. Ask natural-language compliance questions with direct citation to source pages — built-in for Canadian tax originally.
government
regulatory
statistical
academic
Content Marketing / SEO
Content Gap & Keyword Research
Crawl competitor blogs and industry publications to find content gaps. Chat to surface topics competitors rank on that you don't cover, with exact URLs showing where their coverage is strongest.
industry
news
Real Estate / Local Business
Geo-Scoped Market Intelligence
Brief the crawler with a geography (e.g. "Regina, Saskatchewan") and data types. Atlas prioritizes local service providers, news outlets, and government data sources. Ask about trends, pricing, and market activity.
forum

REST API Surface

04
GET
/api/projects
POST
/api/projects/:id
DEL
/api/projects/:id/renew
/api/projects/:id/discover-sources
/api/projects/:id/accept-sources
/api/stats?project_id
/api/sources?project_id
/api/sources
PATCH
/api/sources/:id
/api/sources/:id/scrape
/api/sources/scrape-all
/api/sources/:id/reset-coverage
/api/scrape-jobs
/api/scrape-jobs/active
/api/scrape-jobs/:id
/api/documents?project_id
/api/documents/:id
/api/chat
/api/index/rebuild
/api/maintenance/dedupe
/api/maintenance/purge-off-scope

Skills Demonstrated

05
Expert
FastAPI + Async Python
LLM Integration
MongoDB + Motor
Web Scraping / crawling
Advanced
React 19 / shadcn
TF-IDF / RAG Systems
scikit-learn / NLP
API Design + Testing

Project Roadmap

06
✓ Shipped — Feb 2026
Core Crawl + RAG Engine live
BFS crawler, TF-IDF index, grounded chat with citations, per-project isolation, 90-day retention, JSON export.
✓ Shipped — Feb 15, 2026
Commercial & Competitor Source Types live
Added commercial, blog, marketplace, competitor, and web data types to discovery prompt and New Project UI.
Re-Discover Sources Dialog live
Admin panel dialog to trigger re-discovery with new data types + extra hints. Already-added URLs auto-skipped.
Next Phase
Web Search Source Discovery planned
Tavily / Brave / SerpAPI integration for real-time web search alongside LLM-based discovery. Needs API key.
Content Diff Viewer planned
Side-by-side diff display when page content changes on re-crawl. Old vs new content with change highlighting.
Future
Scheduled Crawling future
Per-project cron-style scheduling. "Auto New Session weekly" for always-fresh knowledgebases.

Atlas .

Crawler
Topic-driven crawl, scrape & RAG engine. Built with FastAPI, React 19, MongoDB, and Claude.
Key Files
backend/server.py
— All routes, crawler, RAG
frontend/src/lib/api.js
— Axios API client
frontend/src/App.js
— React router + pages
backend/tests/
— pytest integration suite
memory/PRD.md
— Product requirements
Data Types Supported
Government · Regulatory · Statistical
News · Industry · Association
Commercial · Blog · Marketplace
Competitor · Academic · Forum
Web (general)

Complete original text content

01
Platform Capabilities
🤖
LLM-Powered Source Discovery
Claude Sonnet generates 18–28 curated seed URLs from a plain-English project brief. Supports 13 source categories: government, regulatory, news, commercial, blog, marketplace, competitor, academic, forum, and more.
🕷️
Deep Async BFS Crawler
Breadth-first scraping with configurable page depth (default 2) and page limits (default 50). Path-prefix locking keeps crawls on-topic. 15-minute runtime cap with persistent frontier for resume-on-next-run.
🔄
Content-Hash Refresh Detection
SHA-256 hashing detects when page content changes. Stale chunks are automatically deleted and re-indexed. Unchanged pages are skipped. Tracks new, refreshed, and unchanged counts per job.
🔍
TF-IDF Retrieval + RAG Chat
Per-project TF-IDF index with bigram support and cosine similarity scoring. Bonus scoring for query-term density. Top-K chunks injected into LLM context for grounded, citation-backed answers.
📦
Per-Project Isolation
Every project has its own MongoDB collections for sources, documents, chunks, scrape jobs, and chat sessions. Clean deletion cascade. 90-day auto-expiry with hourly background reaper.
📤
JSON Knowledgebase Export
Full portable export of every project's documents and chunks as downloadable JSON. Import/export API with job tracking. Bulletproof browser-anchor download pattern — no popup blockers.
🔧
Admin Maintenance Tooling
Deduplication endpoint removes duplicate documents by source+URL key. Purge-off-scope cleans documents that fall outside current path-prefix rules. Manual index rebuild endpoint for ops teams.
🌐
Multi-LLM Model Selection
Per-chat model switching between Claude Sonnet 4.5, GPT-5.2, and Gemini 3 Pro. Powered by emergentintegrations LlmChat abstraction. Easily extensible to new providers.
🖥️
React 19 Frontend
Full multi-page SPA: Projects dashboard, new-project wizard, per-project admin panel, and chat interface. Built with shadcn/ui, Tailwind CSS, lucide-react icons, and Sonner toast notifications.
02
Technical Architecture
Backend · Python
FastAPI
Async REST API with Pydantic v2 models. APIRouter with /api prefix, CORS middleware, background tasks via asyncio.
Motor
Async MongoDB driver. Per-collection queries scoped by project_id. Cascade deletes, count queries, cursor-based pagination.
httpx
Async HTTP client for the crawler. 20s timeout, redirect-following, respects HTML content-type gates.
BeautifulSoup4
lxml parser. Strips nav/footer/script/aside noise. Extracts main/article content. Link extraction with tracking-param stripping.
scikit-learn
TfidfVectorizer with bigrams, stop-word removal, 50k feature cap. Cosine similarity for per-project chunk retrieval.
Pydantic v2
Strict request/response validation. ProjectBrief, Source, Document, Chunk, ScrapeJob, ChatRequest models.
pytest
Integration test suite for stats, sources, scrape lifecycle, index rebuild, search, and LLM chat endpoints.
Frontend · React / JS
React 19
Latest React with hooks. react-router-dom v7 for multi-page SPA. useState/useEffect for async data management.
shadcn/ui
Full Radix UI component set: dialogs, dropdowns, badges, tabs, progress, toasts, accordion, command palette, and more.
Tailwind CSS
Utility-first styling with custom design tokens. craco config for path aliasing. PostCSS + autoprefixer pipeline.
axios
Typed API client (api.js) wrapping all backend endpoints. Project-scoped methods, error handling, toast integration.
lucide-react
Icon library. Loader2 spinners, Sparkles for AI actions, ExternalLink, ChevronLeft for navigation context.
sonner
Toast notification system. Long-duration toasts for async discovery (60s), success/error feedback on all mutations.
craco
Custom CRA config for webpack overrides, path aliases (@/components), and the custom health-check plugin.
Data · Infrastructure
MongoDB
8 collections: projects, sources, documents, chunks, scrape_jobs, chat_sessions, messages, import_jobs. Motor async driver.
Claude Sonnet
Source discovery (structured JSON output) + RAG chat. emergentintegrations.LlmChat abstraction with session management.
OpenAI GPT
GPT-5.2 as switchable model option. Same LlmChat interface, per-session conversation history management.
Gemini Pro
Gemini 3 Pro as third model option. google-genai + google-generativeai SDK support in requirements.
uvicorn
ASGI server for FastAPI. Background task execution for crawl jobs and hourly reaper via asyncio.create_task.
Developer Experience
URL normalization
Strips UTM/tracking params, normalizes port, trailing slash, and scheme. Prevents duplicate indexing of canonically identical URLs.
Path-prefix scoping
Per-source allowed_path_prefixes list. LLM discovery auto-suggests path scopes. User-editable via Admin panel.
Chunking engine
700-word target chunks with 100-word overlap. Paragraph-aware splitting. Long paragraphs split with sliding window.
WAF-safe DELETE
POST alias /delete endpoint for environments where DELETE is blocked at the ingress or WAF layer.
Polite crawling
AtlasCrawler/1.0 User-Agent. 250ms inter-request delay. Skips non-200, non-HTML, and binary file extensions.
03
Real-World Use Cases
Competitive Intelligence
Market Research & Competitor Analysis
Crawl competitor websites, industry blogs, and marketplace directories. Ask "What pricing strategies are competitors using in Saskatchewan?" and get cited answers from the actual pages — not hallucinations.
commercial
competitor
marketplace
blog
Legal & Regulatory
Compliance Knowledge Base
Index government portals, regulatory agencies, and income tax acts across jurisdictions. Ask natural-language compliance questions with direct citation to source pages — built-in for Canadian tax originally.
government
regulatory
statistical
academic
Content Marketing / SEO
Content Gap & Keyword Research
Crawl competitor blogs and industry publications to find content gaps. Chat to surface topics competitors rank on that you don't cover, with exact URLs showing where their coverage is strongest.
industry
news
Real Estate / Local Business
Geo-Scoped Market Intelligence
Brief the crawler with a geography (e.g. "Regina, Saskatchewan") and data types. Atlas prioritizes local service providers, news outlets, and government data sources. Ask about trends, pricing, and market activity.
forum
04
REST API Surface
GET
/api/projects
POST
/api/projects/:id
DEL
/api/projects/:id/renew
/api/projects/:id/discover-sources
/api/projects/:id/accept-sources
/api/stats?project_id
/api/sources?project_id
/api/sources
PATCH
/api/sources/:id
/api/sources/:id/scrape
/api/sources/scrape-all
/api/sources/:id/reset-coverage
/api/scrape-jobs
/api/scrape-jobs/active
/api/scrape-jobs/:id
/api/documents?project_id
/api/documents/:id
/api/chat
/api/index/rebuild
/api/maintenance/dedupe
/api/maintenance/purge-off-scope
05
Skills Demonstrated
Expert
FastAPI + Async Python
LLM Integration
MongoDB + Motor
Web Scraping / crawling
Advanced
React 19 / shadcn
TF-IDF / RAG Systems
scikit-learn / NLP
API Design + Testing
06
Project Roadmap
✓ Shipped — Feb 2026
Core Crawl + RAG Engine live
BFS crawler, TF-IDF index, grounded chat with citations, per-project isolation, 90-day retention, JSON export.
✓ Shipped — Feb 15, 2026
Commercial & Competitor Source Types live
Added commercial, blog, marketplace, competitor, and web data types to discovery prompt and New Project UI.
Re-Discover Sources Dialog live
Admin panel dialog to trigger re-discovery with new data types + extra hints. Already-added URLs auto-skipped.
Next Phase
Web Search Source Discovery planned
Tavily / Brave / SerpAPI integration for real-time web search alongside LLM-based discovery. Needs API key.
Content Diff Viewer planned
Side-by-side diff display when page content changes on re-crawl. Old vs new content with change highlighting.
Future
Scheduled Crawling future
Per-project cron-style scheduling. "Auto New Session weekly" for always-fresh knowledgebases.
Atlas .
Crawler
Topic-driven crawl, scrape & RAG engine. Built with FastAPI, React 19, MongoDB, and Claude.
Key Files
backend/server.py
— All routes, crawler, RAG
frontend/src/lib/api.js
— Axios API client
frontend/src/App.js
— React router + pages
backend/tests/
— pytest integration suite
memory/PRD.md
— Product requirements
Data Types Supported
Government · Regulatory · Statistical
News · Industry · Association
Commercial · Blog · Marketplace
Competitor · Academic · Forum
Web (general)