01
Platform Capabilities
- 01
- ๐ค
- LLM-Powered Source Discovery
- Claude Sonnet generates 18โ28 curated seed URLs from a plain-English project brief. Supports 13 source categories: government, regulatory, news, commercial, blog, marketplace, competitor, academic, forum, and more.
- ๐ท๏ธ
- Deep Async BFS Crawler
- Breadth-first scraping with configurable page depth (default 2) and page limits (default 50). Path-prefix locking keeps crawls on-topic. 15-minute runtime cap with persistent frontier for resume-on-next-run.
- ๐
- Content-Hash Refresh Detection
- SHA-256 hashing detects when page content changes. Stale chunks are automatically deleted and re-indexed. Unchanged pages are skipped. Tracks new, refreshed, and unchanged counts per job.
- ๐
- TF-IDF Retrieval + RAG Chat
- Per-project TF-IDF index with bigram support and cosine similarity scoring. Bonus scoring for query-term density. Top-K chunks injected into LLM context for grounded, citation-backed answers.
- ๐ฆ
- Per-Project Isolation
- Every project has its own MongoDB collections for sources, documents, chunks, scrape jobs, and chat sessions. Clean deletion cascade. 90-day auto-expiry with hourly background reaper.
- ๐ค
- JSON Knowledgebase Export
- Full portable export of every project's documents and chunks as downloadable JSON. Import/export API with job tracking. Bulletproof browser-anchor download pattern โ no popup blockers.
- ๐ง
- Admin Maintenance Tooling
- Deduplication endpoint removes duplicate documents by source+URL key. Purge-off-scope cleans documents that fall outside current path-prefix rules. Manual index rebuild endpoint for ops teams.
- ๐
- Multi-LLM Model Selection
- Per-chat model switching between Claude Sonnet 4.5, GPT-5.2, and Gemini 3 Pro. Powered by emergentintegrations LlmChat abstraction. Easily extensible to new providers.
- ๐ฅ๏ธ
- React 19 Frontend
- Full multi-page SPA: Projects dashboard, new-project wizard, per-project admin panel, and chat interface. Built with shadcn/ui, Tailwind CSS, lucide-react icons, and Sonner toast notifications.
02
Technical Architecture
- 02
- Backend ยท Python
- FastAPI
- Async REST API with Pydantic v2 models. APIRouter with /api prefix, CORS middleware, background tasks via asyncio.
- Motor
- Async MongoDB driver. Per-collection queries scoped by project_id. Cascade deletes, count queries, cursor-based pagination.
- httpx
- Async HTTP client for the crawler. 20s timeout, redirect-following, respects HTML content-type gates.
- BeautifulSoup4
- lxml parser. Strips nav/footer/script/aside noise. Extracts main/article content. Link extraction with tracking-param stripping.
- scikit-learn
- TfidfVectorizer with bigrams, stop-word removal, 50k feature cap. Cosine similarity for per-project chunk retrieval.
- Pydantic v2
- Strict request/response validation. ProjectBrief, Source, Document, Chunk, ScrapeJob, ChatRequest models.
- pytest
- Integration test suite for stats, sources, scrape lifecycle, index rebuild, search, and LLM chat endpoints.
- Frontend ยท React / JS
- React 19
- Latest React with hooks. react-router-dom v7 for multi-page SPA. useState/useEffect for async data management.
- shadcn/ui
- Full Radix UI component set: dialogs, dropdowns, badges, tabs, progress, toasts, accordion, command palette, and more.
- Tailwind CSS
- Utility-first styling with custom design tokens. craco config for path aliasing. PostCSS + autoprefixer pipeline.
- axios
- Typed API client (api.js) wrapping all backend endpoints. Project-scoped methods, error handling, toast integration.
- lucide-react
- Icon library. Loader2 spinners, Sparkles for AI actions, ExternalLink, ChevronLeft for navigation context.
- sonner
- Toast notification system. Long-duration toasts for async discovery (60s), success/error feedback on all mutations.
- craco
- Custom CRA config for webpack overrides, path aliases (@/components), and the custom health-check plugin.
- Data ยท Infrastructure
- MongoDB
- 8 collections: projects, sources, documents, chunks, scrape_jobs, chat_sessions, messages, import_jobs. Motor async driver.
- Claude Sonnet
- Source discovery (structured JSON output) + RAG chat. emergentintegrations.LlmChat abstraction with session management.
- OpenAI GPT
- GPT-5.2 as switchable model option. Same LlmChat interface, per-session conversation history management.
- Gemini Pro
- Gemini 3 Pro as third model option. google-genai + google-generativeai SDK support in requirements.
- uvicorn
- ASGI server for FastAPI. Background task execution for crawl jobs and hourly reaper via asyncio.create_task.
- Developer Experience
- URL normalization
- Strips UTM/tracking params, normalizes port, trailing slash, and scheme. Prevents duplicate indexing of canonically identical URLs.
- Path-prefix scoping
- Per-source allowed_path_prefixes list. LLM discovery auto-suggests path scopes. User-editable via Admin panel.
- Chunking engine
- 700-word target chunks with 100-word overlap. Paragraph-aware splitting. Long paragraphs split with sliding window.
- WAF-safe DELETE
- POST alias /delete endpoint for environments where DELETE is blocked at the ingress or WAF layer.
- Polite crawling
- AtlasCrawler/1.0 User-Agent. 250ms inter-request delay. Skips non-200, non-HTML, and binary file extensions.
03
Real-World Use Cases
- 03
- Competitive Intelligence
- Market Research & Competitor Analysis
- Crawl competitor websites, industry blogs, and marketplace directories. Ask "What pricing strategies are competitors using in Saskatchewan?" and get cited answers from the actual pages โ not hallucinations.
- commercial
- competitor
- marketplace
- blog
- Legal & Regulatory
- Compliance Knowledge Base
- Index government portals, regulatory agencies, and income tax acts across jurisdictions. Ask natural-language compliance questions with direct citation to source pages โ built-in for Canadian tax originally.
- government
- regulatory
- statistical
- academic
- Content Marketing / SEO
- Content Gap & Keyword Research
- Crawl competitor blogs and industry publications to find content gaps. Chat to surface topics competitors rank on that you don't cover, with exact URLs showing where their coverage is strongest.
- industry
- news
- Real Estate / Local Business
- Geo-Scoped Market Intelligence
- Brief the crawler with a geography (e.g. "Regina, Saskatchewan") and data types. Atlas prioritizes local service providers, news outlets, and government data sources. Ask about trends, pricing, and market activity.
- forum
04
REST API Surface
- 04
- GET
- /api/projects
- POST
- /api/projects/:id
- DEL
- /api/projects/:id/renew
- /api/projects/:id/discover-sources
- /api/projects/:id/accept-sources
- /api/stats?project_id
- /api/sources?project_id
- /api/sources
- PATCH
- /api/sources/:id
- /api/sources/:id/scrape
- /api/sources/scrape-all
- /api/sources/:id/reset-coverage
- /api/scrape-jobs
- /api/scrape-jobs/active
- /api/scrape-jobs/:id
- /api/documents?project_id
- /api/documents/:id
- /api/chat
- /api/index/rebuild
- /api/maintenance/dedupe
- /api/maintenance/purge-off-scope
05
Skills Demonstrated
- 05
- Expert
- FastAPI + Async Python
- LLM Integration
- MongoDB + Motor
- Web Scraping / crawling
- Advanced
- React 19 / shadcn
- TF-IDF / RAG Systems
- scikit-learn / NLP
- API Design + Testing
06
Project Roadmap
- 06
- โ Shipped โ Feb 2026
- Core Crawl + RAG Engine live
- BFS crawler, TF-IDF index, grounded chat with citations, per-project isolation, 90-day retention, JSON export.
- โ Shipped โ Feb 15, 2026
- Commercial & Competitor Source Types live
- Added commercial, blog, marketplace, competitor, and web data types to discovery prompt and New Project UI.
- Re-Discover Sources Dialog live
- Admin panel dialog to trigger re-discovery with new data types + extra hints. Already-added URLs auto-skipped.
- Next Phase
- Web Search Source Discovery planned
- Tavily / Brave / SerpAPI integration for real-time web search alongside LLM-based discovery. Needs API key.
- Content Diff Viewer planned
- Side-by-side diff display when page content changes on re-crawl. Old vs new content with change highlighting.
- Future
- Scheduled Crawling future
- Per-project cron-style scheduling. "Auto New Session weekly" for always-fresh knowledgebases.
07
Atlas .
- Crawler
- Topic-driven crawl, scrape & RAG engine. Built with FastAPI, React 19, MongoDB, and Claude.
- Key Files
- backend/server.py
- โ All routes, crawler, RAG
- frontend/src/lib/api.js
- โ Axios API client
- frontend/src/App.js
- โ React router + pages
- backend/tests/
- โ pytest integration suite
- memory/PRD.md
- โ Product requirements
- Data Types Supported
- Government ยท Regulatory ยท Statistical
- News ยท Industry ยท Association
- Commercial ยท Blog ยท Marketplace
- Competitor ยท Academic ยท Forum
- Web (general)
Complete original text content
- 01
- Platform Capabilities
- ๐ค
- LLM-Powered Source Discovery
- Claude Sonnet generates 18โ28 curated seed URLs from a plain-English project brief. Supports 13 source categories: government, regulatory, news, commercial, blog, marketplace, competitor, academic, forum, and more.
- ๐ท๏ธ
- Deep Async BFS Crawler
- Breadth-first scraping with configurable page depth (default 2) and page limits (default 50). Path-prefix locking keeps crawls on-topic. 15-minute runtime cap with persistent frontier for resume-on-next-run.
- ๐
- Content-Hash Refresh Detection
- SHA-256 hashing detects when page content changes. Stale chunks are automatically deleted and re-indexed. Unchanged pages are skipped. Tracks new, refreshed, and unchanged counts per job.
- ๐
- TF-IDF Retrieval + RAG Chat
- Per-project TF-IDF index with bigram support and cosine similarity scoring. Bonus scoring for query-term density. Top-K chunks injected into LLM context for grounded, citation-backed answers.
- ๐ฆ
- Per-Project Isolation
- Every project has its own MongoDB collections for sources, documents, chunks, scrape jobs, and chat sessions. Clean deletion cascade. 90-day auto-expiry with hourly background reaper.
- ๐ค
- JSON Knowledgebase Export
- Full portable export of every project's documents and chunks as downloadable JSON. Import/export API with job tracking. Bulletproof browser-anchor download pattern โ no popup blockers.
- ๐ง
- Admin Maintenance Tooling
- Deduplication endpoint removes duplicate documents by source+URL key. Purge-off-scope cleans documents that fall outside current path-prefix rules. Manual index rebuild endpoint for ops teams.
- ๐
- Multi-LLM Model Selection
- Per-chat model switching between Claude Sonnet 4.5, GPT-5.2, and Gemini 3 Pro. Powered by emergentintegrations LlmChat abstraction. Easily extensible to new providers.
- ๐ฅ๏ธ
- React 19 Frontend
- Full multi-page SPA: Projects dashboard, new-project wizard, per-project admin panel, and chat interface. Built with shadcn/ui, Tailwind CSS, lucide-react icons, and Sonner toast notifications.
- 02
- Technical Architecture
- Backend ยท Python
- FastAPI
- Async REST API with Pydantic v2 models. APIRouter with /api prefix, CORS middleware, background tasks via asyncio.
- Motor
- Async MongoDB driver. Per-collection queries scoped by project_id. Cascade deletes, count queries, cursor-based pagination.
- httpx
- Async HTTP client for the crawler. 20s timeout, redirect-following, respects HTML content-type gates.
- BeautifulSoup4
- lxml parser. Strips nav/footer/script/aside noise. Extracts main/article content. Link extraction with tracking-param stripping.
- scikit-learn
- TfidfVectorizer with bigrams, stop-word removal, 50k feature cap. Cosine similarity for per-project chunk retrieval.
- Pydantic v2
- Strict request/response validation. ProjectBrief, Source, Document, Chunk, ScrapeJob, ChatRequest models.
- pytest
- Integration test suite for stats, sources, scrape lifecycle, index rebuild, search, and LLM chat endpoints.
- Frontend ยท React / JS
- React 19
- Latest React with hooks. react-router-dom v7 for multi-page SPA. useState/useEffect for async data management.
- shadcn/ui
- Full Radix UI component set: dialogs, dropdowns, badges, tabs, progress, toasts, accordion, command palette, and more.
- Tailwind CSS
- Utility-first styling with custom design tokens. craco config for path aliasing. PostCSS + autoprefixer pipeline.
- axios
- Typed API client (api.js) wrapping all backend endpoints. Project-scoped methods, error handling, toast integration.
- lucide-react
- Icon library. Loader2 spinners, Sparkles for AI actions, ExternalLink, ChevronLeft for navigation context.
- sonner
- Toast notification system. Long-duration toasts for async discovery (60s), success/error feedback on all mutations.
- craco
- Custom CRA config for webpack overrides, path aliases (@/components), and the custom health-check plugin.
- Data ยท Infrastructure
- MongoDB
- 8 collections: projects, sources, documents, chunks, scrape_jobs, chat_sessions, messages, import_jobs. Motor async driver.
- Claude Sonnet
- Source discovery (structured JSON output) + RAG chat. emergentintegrations.LlmChat abstraction with session management.
- OpenAI GPT
- GPT-5.2 as switchable model option. Same LlmChat interface, per-session conversation history management.
- Gemini Pro
- Gemini 3 Pro as third model option. google-genai + google-generativeai SDK support in requirements.
- uvicorn
- ASGI server for FastAPI. Background task execution for crawl jobs and hourly reaper via asyncio.create_task.
- Developer Experience
- URL normalization
- Strips UTM/tracking params, normalizes port, trailing slash, and scheme. Prevents duplicate indexing of canonically identical URLs.
- Path-prefix scoping
- Per-source allowed_path_prefixes list. LLM discovery auto-suggests path scopes. User-editable via Admin panel.
- Chunking engine
- 700-word target chunks with 100-word overlap. Paragraph-aware splitting. Long paragraphs split with sliding window.
- WAF-safe DELETE
- POST alias /delete endpoint for environments where DELETE is blocked at the ingress or WAF layer.
- Polite crawling
- AtlasCrawler/1.0 User-Agent. 250ms inter-request delay. Skips non-200, non-HTML, and binary file extensions.
- 03
- Real-World Use Cases
- Competitive Intelligence
- Market Research & Competitor Analysis
- Crawl competitor websites, industry blogs, and marketplace directories. Ask "What pricing strategies are competitors using in Saskatchewan?" and get cited answers from the actual pages โ not hallucinations.
- commercial
- competitor
- marketplace
- blog
- Legal & Regulatory
- Compliance Knowledge Base
- Index government portals, regulatory agencies, and income tax acts across jurisdictions. Ask natural-language compliance questions with direct citation to source pages โ built-in for Canadian tax originally.
- government
- regulatory
- statistical
- academic
- Content Marketing / SEO
- Content Gap & Keyword Research
- Crawl competitor blogs and industry publications to find content gaps. Chat to surface topics competitors rank on that you don't cover, with exact URLs showing where their coverage is strongest.
- industry
- news
- Real Estate / Local Business
- Geo-Scoped Market Intelligence
- Brief the crawler with a geography (e.g. "Regina, Saskatchewan") and data types. Atlas prioritizes local service providers, news outlets, and government data sources. Ask about trends, pricing, and market activity.
- forum
- 04
- REST API Surface
- GET
- /api/projects
- POST
- /api/projects/:id
- DEL
- /api/projects/:id/renew
- /api/projects/:id/discover-sources
- /api/projects/:id/accept-sources
- /api/stats?project_id
- /api/sources?project_id
- /api/sources
- PATCH
- /api/sources/:id
- /api/sources/:id/scrape
- /api/sources/scrape-all
- /api/sources/:id/reset-coverage
- /api/scrape-jobs
- /api/scrape-jobs/active
- /api/scrape-jobs/:id
- /api/documents?project_id
- /api/documents/:id
- /api/chat
- /api/index/rebuild
- /api/maintenance/dedupe
- /api/maintenance/purge-off-scope
- 05
- Skills Demonstrated
- Expert
- FastAPI + Async Python
- LLM Integration
- MongoDB + Motor
- Web Scraping / crawling
- Advanced
- React 19 / shadcn
- TF-IDF / RAG Systems
- scikit-learn / NLP
- API Design + Testing
- 06
- Project Roadmap
- โ Shipped โ Feb 2026
- Core Crawl + RAG Engine live
- BFS crawler, TF-IDF index, grounded chat with citations, per-project isolation, 90-day retention, JSON export.
- โ Shipped โ Feb 15, 2026
- Commercial & Competitor Source Types live
- Added commercial, blog, marketplace, competitor, and web data types to discovery prompt and New Project UI.
- Re-Discover Sources Dialog live
- Admin panel dialog to trigger re-discovery with new data types + extra hints. Already-added URLs auto-skipped.
- Next Phase
- Web Search Source Discovery planned
- Tavily / Brave / SerpAPI integration for real-time web search alongside LLM-based discovery. Needs API key.
- Content Diff Viewer planned
- Side-by-side diff display when page content changes on re-crawl. Old vs new content with change highlighting.
- Future
- Scheduled Crawling future
- Per-project cron-style scheduling. "Auto New Session weekly" for always-fresh knowledgebases.
- Atlas .
- Crawler
- Topic-driven crawl, scrape & RAG engine. Built with FastAPI, React 19, MongoDB, and Claude.
- Key Files
- backend/server.py
- โ All routes, crawler, RAG
- frontend/src/lib/api.js
- โ Axios API client
- frontend/src/App.js
- โ React router + pages
- backend/tests/
- โ pytest integration suite
- memory/PRD.md
- โ Product requirements
- Data Types Supported
- Government ยท Regulatory ยท Statistical
- News ยท Industry ยท Association
- Commercial ยท Blog ยท Marketplace
- Competitor ยท Academic ยท Forum
- Web (general)