Project Profile

Atlas Crawler

Topic-driven crawl, scrape, and RAG platform. A full-stack research system for source discovery, deep crawling, content extraction, indexing, and grounded AI chat.

Claude Sonnet generates 18โ€“28 curated seed URLs from a plain-English project brief. Supports 13 source categories: government, regulatory, news, commercial, blog, marketplace, competitor, academic, forum, and more.Every project has its own MongoDB collections for sources, documents, chunks, scrape jobs, and chat sessions. Clean deletion cascade. 90-day auto-expiry with hourly background reaper.Per-chat model switching between Claude Sonnet 4.5, GPT-5.2, and Gemini 3 Pro. Powered by emergentintegrations LlmChat abstraction. Easily extensible to new providers.React 19 FrontendFull multi-page SPA: Projects dashboard, new-project wizard, per-project admin panel, and chat interface. Built with shadcn/ui, Tailwind CSS, lucide-react icons, and Sonner toast notifications.Backend ยท PythonFastAPIAsync MongoDB driver. Per-collection queries scoped by project_id. Cascade deletes, count queries, cursor-based pagination.

Platform Capabilities

  • 01
  • ๐Ÿค–
  • LLM-Powered Source Discovery
  • Claude Sonnet generates 18โ€“28 curated seed URLs from a plain-English project brief. Supports 13 source categories: government, regulatory, news, commercial, blog, marketplace, competitor, academic, forum, and more.
  • ๐Ÿ•ท๏ธ
  • Deep Async BFS Crawler
  • Breadth-first scraping with configurable page depth (default 2) and page limits (default 50). Path-prefix locking keeps crawls on-topic. 15-minute runtime cap with persistent frontier for resume-on-next-run.
  • ๐Ÿ”„
  • Content-Hash Refresh Detection
  • SHA-256 hashing detects when page content changes. Stale chunks are automatically deleted and re-indexed. Unchanged pages are skipped. Tracks new, refreshed, and unchanged counts per job.
  • ๐Ÿ”
  • TF-IDF Retrieval + RAG Chat
  • Per-project TF-IDF index with bigram support and cosine similarity scoring. Bonus scoring for query-term density. Top-K chunks injected into LLM context for grounded, citation-backed answers.
  • ๐Ÿ“ฆ
  • Per-Project Isolation
  • Every project has its own MongoDB collections for sources, documents, chunks, scrape jobs, and chat sessions. Clean deletion cascade. 90-day auto-expiry with hourly background reaper.
  • ๐Ÿ“ค
  • JSON Knowledgebase Export
  • Full portable export of every project's documents and chunks as downloadable JSON. Import/export API with job tracking. Bulletproof browser-anchor download pattern โ€” no popup blockers.
  • ๐Ÿ”ง
  • Admin Maintenance Tooling
  • Deduplication endpoint removes duplicate documents by source+URL key. Purge-off-scope cleans documents that fall outside current path-prefix rules. Manual index rebuild endpoint for ops teams.
  • ๐ŸŒ
  • Multi-LLM Model Selection
  • Per-chat model switching between Claude Sonnet 4.5, GPT-5.2, and Gemini 3 Pro. Powered by emergentintegrations LlmChat abstraction. Easily extensible to new providers.
  • ๐Ÿ–ฅ๏ธ
  • React 19 Frontend
  • Full multi-page SPA: Projects dashboard, new-project wizard, per-project admin panel, and chat interface. Built with shadcn/ui, Tailwind CSS, lucide-react icons, and Sonner toast notifications.

Technical Architecture

  • 02
  • Backend ยท Python
  • FastAPI
  • Async REST API with Pydantic v2 models. APIRouter with /api prefix, CORS middleware, background tasks via asyncio.
  • Motor
  • Async MongoDB driver. Per-collection queries scoped by project_id. Cascade deletes, count queries, cursor-based pagination.
  • httpx
  • Async HTTP client for the crawler. 20s timeout, redirect-following, respects HTML content-type gates.
  • BeautifulSoup4
  • lxml parser. Strips nav/footer/script/aside noise. Extracts main/article content. Link extraction with tracking-param stripping.
  • scikit-learn
  • TfidfVectorizer with bigrams, stop-word removal, 50k feature cap. Cosine similarity for per-project chunk retrieval.
  • Pydantic v2
  • Strict request/response validation. ProjectBrief, Source, Document, Chunk, ScrapeJob, ChatRequest models.
  • pytest
  • Integration test suite for stats, sources, scrape lifecycle, index rebuild, search, and LLM chat endpoints.
  • Frontend ยท React / JS
  • React 19
  • Latest React with hooks. react-router-dom v7 for multi-page SPA. useState/useEffect for async data management.
  • shadcn/ui
  • Full Radix UI component set: dialogs, dropdowns, badges, tabs, progress, toasts, accordion, command palette, and more.
  • Tailwind CSS
  • Utility-first styling with custom design tokens. craco config for path aliasing. PostCSS + autoprefixer pipeline.
  • axios
  • Typed API client (api.js) wrapping all backend endpoints. Project-scoped methods, error handling, toast integration.
  • lucide-react
  • Icon library. Loader2 spinners, Sparkles for AI actions, ExternalLink, ChevronLeft for navigation context.
  • sonner
  • Toast notification system. Long-duration toasts for async discovery (60s), success/error feedback on all mutations.
  • craco
  • Custom CRA config for webpack overrides, path aliases (@/components), and the custom health-check plugin.
  • Data ยท Infrastructure
  • MongoDB
  • 8 collections: projects, sources, documents, chunks, scrape_jobs, chat_sessions, messages, import_jobs. Motor async driver.
  • Claude Sonnet
  • Source discovery (structured JSON output) + RAG chat. emergentintegrations.LlmChat abstraction with session management.
  • OpenAI GPT
  • GPT-5.2 as switchable model option. Same LlmChat interface, per-session conversation history management.
  • Gemini Pro
  • Gemini 3 Pro as third model option. google-genai + google-generativeai SDK support in requirements.
  • uvicorn
  • ASGI server for FastAPI. Background task execution for crawl jobs and hourly reaper via asyncio.create_task.
  • Developer Experience
  • URL normalization
  • Strips UTM/tracking params, normalizes port, trailing slash, and scheme. Prevents duplicate indexing of canonically identical URLs.
  • Path-prefix scoping
  • Per-source allowed_path_prefixes list. LLM discovery auto-suggests path scopes. User-editable via Admin panel.
  • Chunking engine
  • 700-word target chunks with 100-word overlap. Paragraph-aware splitting. Long paragraphs split with sliding window.
  • WAF-safe DELETE
  • POST alias /delete endpoint for environments where DELETE is blocked at the ingress or WAF layer.
  • Polite crawling
  • AtlasCrawler/1.0 User-Agent. 250ms inter-request delay. Skips non-200, non-HTML, and binary file extensions.

Real-World Use Cases

  • 03
  • Competitive Intelligence
  • Market Research & Competitor Analysis
  • Crawl competitor websites, industry blogs, and marketplace directories. Ask "What pricing strategies are competitors using in Saskatchewan?" and get cited answers from the actual pages โ€” not hallucinations.
  • commercial
  • competitor
  • marketplace
  • blog
  • Legal & Regulatory
  • Compliance Knowledge Base
  • Index government portals, regulatory agencies, and income tax acts across jurisdictions. Ask natural-language compliance questions with direct citation to source pages โ€” built-in for Canadian tax originally.
  • government
  • regulatory
  • statistical
  • academic
  • Content Marketing / SEO
  • Content Gap & Keyword Research
  • Crawl competitor blogs and industry publications to find content gaps. Chat to surface topics competitors rank on that you don't cover, with exact URLs showing where their coverage is strongest.
  • industry
  • news
  • Real Estate / Local Business
  • Geo-Scoped Market Intelligence
  • Brief the crawler with a geography (e.g. "Regina, Saskatchewan") and data types. Atlas prioritizes local service providers, news outlets, and government data sources. Ask about trends, pricing, and market activity.
  • forum

REST API Surface

  • 04
  • GET
  • /api/projects
  • POST
  • /api/projects/:id
  • DEL
  • /api/projects/:id/renew
  • /api/projects/:id/discover-sources
  • /api/projects/:id/accept-sources
  • /api/stats?project_id
  • /api/sources?project_id
  • /api/sources
  • PATCH
  • /api/sources/:id
  • /api/sources/:id/scrape
  • /api/sources/scrape-all
  • /api/sources/:id/reset-coverage
  • /api/scrape-jobs
  • /api/scrape-jobs/active
  • /api/scrape-jobs/:id
  • /api/documents?project_id
  • /api/documents/:id
  • /api/chat
  • /api/index/rebuild
  • /api/maintenance/dedupe
  • /api/maintenance/purge-off-scope

Skills Demonstrated

  • 05
  • Expert
  • FastAPI + Async Python
  • LLM Integration
  • MongoDB + Motor
  • Web Scraping / crawling
  • Advanced
  • React 19 / shadcn
  • TF-IDF / RAG Systems
  • scikit-learn / NLP
  • API Design + Testing

Project Roadmap

  • 06
  • โœ“ Shipped โ€” Feb 2026
  • Core Crawl + RAG Engine live
  • BFS crawler, TF-IDF index, grounded chat with citations, per-project isolation, 90-day retention, JSON export.
  • โœ“ Shipped โ€” Feb 15, 2026
  • Commercial & Competitor Source Types live
  • Added commercial, blog, marketplace, competitor, and web data types to discovery prompt and New Project UI.
  • Re-Discover Sources Dialog live
  • Admin panel dialog to trigger re-discovery with new data types + extra hints. Already-added URLs auto-skipped.
  • Next Phase
  • Web Search Source Discovery planned
  • Tavily / Brave / SerpAPI integration for real-time web search alongside LLM-based discovery. Needs API key.
  • Content Diff Viewer planned
  • Side-by-side diff display when page content changes on re-crawl. Old vs new content with change highlighting.
  • Future
  • Scheduled Crawling future
  • Per-project cron-style scheduling. "Auto New Session weekly" for always-fresh knowledgebases.

Atlas .

  • Crawler
  • Topic-driven crawl, scrape & RAG engine. Built with FastAPI, React 19, MongoDB, and Claude.
  • Key Files
  • backend/server.py
  • โ€” All routes, crawler, RAG
  • frontend/src/lib/api.js
  • โ€” Axios API client
  • frontend/src/App.js
  • โ€” React router + pages
  • backend/tests/
  • โ€” pytest integration suite
  • memory/PRD.md
  • โ€” Product requirements
  • Data Types Supported
  • Government ยท Regulatory ยท Statistical
  • News ยท Industry ยท Association
  • Commercial ยท Blog ยท Marketplace
  • Competitor ยท Academic ยท Forum
  • Web (general)
Complete original text content
  1. 01
  2. Platform Capabilities
  3. ๐Ÿค–
  4. LLM-Powered Source Discovery
  5. Claude Sonnet generates 18โ€“28 curated seed URLs from a plain-English project brief. Supports 13 source categories: government, regulatory, news, commercial, blog, marketplace, competitor, academic, forum, and more.
  6. ๐Ÿ•ท๏ธ
  7. Deep Async BFS Crawler
  8. Breadth-first scraping with configurable page depth (default 2) and page limits (default 50). Path-prefix locking keeps crawls on-topic. 15-minute runtime cap with persistent frontier for resume-on-next-run.
  9. ๐Ÿ”„
  10. Content-Hash Refresh Detection
  11. SHA-256 hashing detects when page content changes. Stale chunks are automatically deleted and re-indexed. Unchanged pages are skipped. Tracks new, refreshed, and unchanged counts per job.
  12. ๐Ÿ”
  13. TF-IDF Retrieval + RAG Chat
  14. Per-project TF-IDF index with bigram support and cosine similarity scoring. Bonus scoring for query-term density. Top-K chunks injected into LLM context for grounded, citation-backed answers.
  15. ๐Ÿ“ฆ
  16. Per-Project Isolation
  17. Every project has its own MongoDB collections for sources, documents, chunks, scrape jobs, and chat sessions. Clean deletion cascade. 90-day auto-expiry with hourly background reaper.
  18. ๐Ÿ“ค
  19. JSON Knowledgebase Export
  20. Full portable export of every project's documents and chunks as downloadable JSON. Import/export API with job tracking. Bulletproof browser-anchor download pattern โ€” no popup blockers.
  21. ๐Ÿ”ง
  22. Admin Maintenance Tooling
  23. Deduplication endpoint removes duplicate documents by source+URL key. Purge-off-scope cleans documents that fall outside current path-prefix rules. Manual index rebuild endpoint for ops teams.
  24. ๐ŸŒ
  25. Multi-LLM Model Selection
  26. Per-chat model switching between Claude Sonnet 4.5, GPT-5.2, and Gemini 3 Pro. Powered by emergentintegrations LlmChat abstraction. Easily extensible to new providers.
  27. ๐Ÿ–ฅ๏ธ
  28. React 19 Frontend
  29. Full multi-page SPA: Projects dashboard, new-project wizard, per-project admin panel, and chat interface. Built with shadcn/ui, Tailwind CSS, lucide-react icons, and Sonner toast notifications.
  30. 02
  31. Technical Architecture
  32. Backend ยท Python
  33. FastAPI
  34. Async REST API with Pydantic v2 models. APIRouter with /api prefix, CORS middleware, background tasks via asyncio.
  35. Motor
  36. Async MongoDB driver. Per-collection queries scoped by project_id. Cascade deletes, count queries, cursor-based pagination.
  37. httpx
  38. Async HTTP client for the crawler. 20s timeout, redirect-following, respects HTML content-type gates.
  39. BeautifulSoup4
  40. lxml parser. Strips nav/footer/script/aside noise. Extracts main/article content. Link extraction with tracking-param stripping.
  41. scikit-learn
  42. TfidfVectorizer with bigrams, stop-word removal, 50k feature cap. Cosine similarity for per-project chunk retrieval.
  43. Pydantic v2
  44. Strict request/response validation. ProjectBrief, Source, Document, Chunk, ScrapeJob, ChatRequest models.
  45. pytest
  46. Integration test suite for stats, sources, scrape lifecycle, index rebuild, search, and LLM chat endpoints.
  47. Frontend ยท React / JS
  48. React 19
  49. Latest React with hooks. react-router-dom v7 for multi-page SPA. useState/useEffect for async data management.
  50. shadcn/ui
  51. Full Radix UI component set: dialogs, dropdowns, badges, tabs, progress, toasts, accordion, command palette, and more.
  52. Tailwind CSS
  53. Utility-first styling with custom design tokens. craco config for path aliasing. PostCSS + autoprefixer pipeline.
  54. axios
  55. Typed API client (api.js) wrapping all backend endpoints. Project-scoped methods, error handling, toast integration.
  56. lucide-react
  57. Icon library. Loader2 spinners, Sparkles for AI actions, ExternalLink, ChevronLeft for navigation context.
  58. sonner
  59. Toast notification system. Long-duration toasts for async discovery (60s), success/error feedback on all mutations.
  60. craco
  61. Custom CRA config for webpack overrides, path aliases (@/components), and the custom health-check plugin.
  62. Data ยท Infrastructure
  63. MongoDB
  64. 8 collections: projects, sources, documents, chunks, scrape_jobs, chat_sessions, messages, import_jobs. Motor async driver.
  65. Claude Sonnet
  66. Source discovery (structured JSON output) + RAG chat. emergentintegrations.LlmChat abstraction with session management.
  67. OpenAI GPT
  68. GPT-5.2 as switchable model option. Same LlmChat interface, per-session conversation history management.
  69. Gemini Pro
  70. Gemini 3 Pro as third model option. google-genai + google-generativeai SDK support in requirements.
  71. uvicorn
  72. ASGI server for FastAPI. Background task execution for crawl jobs and hourly reaper via asyncio.create_task.
  73. Developer Experience
  74. URL normalization
  75. Strips UTM/tracking params, normalizes port, trailing slash, and scheme. Prevents duplicate indexing of canonically identical URLs.
  76. Path-prefix scoping
  77. Per-source allowed_path_prefixes list. LLM discovery auto-suggests path scopes. User-editable via Admin panel.
  78. Chunking engine
  79. 700-word target chunks with 100-word overlap. Paragraph-aware splitting. Long paragraphs split with sliding window.
  80. WAF-safe DELETE
  81. POST alias /delete endpoint for environments where DELETE is blocked at the ingress or WAF layer.
  82. Polite crawling
  83. AtlasCrawler/1.0 User-Agent. 250ms inter-request delay. Skips non-200, non-HTML, and binary file extensions.
  84. 03
  85. Real-World Use Cases
  86. Competitive Intelligence
  87. Market Research & Competitor Analysis
  88. Crawl competitor websites, industry blogs, and marketplace directories. Ask "What pricing strategies are competitors using in Saskatchewan?" and get cited answers from the actual pages โ€” not hallucinations.
  89. commercial
  90. competitor
  91. marketplace
  92. blog
  93. Legal & Regulatory
  94. Compliance Knowledge Base
  95. Index government portals, regulatory agencies, and income tax acts across jurisdictions. Ask natural-language compliance questions with direct citation to source pages โ€” built-in for Canadian tax originally.
  96. government
  97. regulatory
  98. statistical
  99. academic
  100. Content Marketing / SEO
  101. Content Gap & Keyword Research
  102. Crawl competitor blogs and industry publications to find content gaps. Chat to surface topics competitors rank on that you don't cover, with exact URLs showing where their coverage is strongest.
  103. industry
  104. news
  105. Real Estate / Local Business
  106. Geo-Scoped Market Intelligence
  107. Brief the crawler with a geography (e.g. "Regina, Saskatchewan") and data types. Atlas prioritizes local service providers, news outlets, and government data sources. Ask about trends, pricing, and market activity.
  108. forum
  109. 04
  110. REST API Surface
  111. GET
  112. /api/projects
  113. POST
  114. /api/projects/:id
  115. DEL
  116. /api/projects/:id/renew
  117. /api/projects/:id/discover-sources
  118. /api/projects/:id/accept-sources
  119. /api/stats?project_id
  120. /api/sources?project_id
  121. /api/sources
  122. PATCH
  123. /api/sources/:id
  124. /api/sources/:id/scrape
  125. /api/sources/scrape-all
  126. /api/sources/:id/reset-coverage
  127. /api/scrape-jobs
  128. /api/scrape-jobs/active
  129. /api/scrape-jobs/:id
  130. /api/documents?project_id
  131. /api/documents/:id
  132. /api/chat
  133. /api/index/rebuild
  134. /api/maintenance/dedupe
  135. /api/maintenance/purge-off-scope
  136. 05
  137. Skills Demonstrated
  138. Expert
  139. FastAPI + Async Python
  140. LLM Integration
  141. MongoDB + Motor
  142. Web Scraping / crawling
  143. Advanced
  144. React 19 / shadcn
  145. TF-IDF / RAG Systems
  146. scikit-learn / NLP
  147. API Design + Testing
  148. 06
  149. Project Roadmap
  150. โœ“ Shipped โ€” Feb 2026
  151. Core Crawl + RAG Engine live
  152. BFS crawler, TF-IDF index, grounded chat with citations, per-project isolation, 90-day retention, JSON export.
  153. โœ“ Shipped โ€” Feb 15, 2026
  154. Commercial & Competitor Source Types live
  155. Added commercial, blog, marketplace, competitor, and web data types to discovery prompt and New Project UI.
  156. Re-Discover Sources Dialog live
  157. Admin panel dialog to trigger re-discovery with new data types + extra hints. Already-added URLs auto-skipped.
  158. Next Phase
  159. Web Search Source Discovery planned
  160. Tavily / Brave / SerpAPI integration for real-time web search alongside LLM-based discovery. Needs API key.
  161. Content Diff Viewer planned
  162. Side-by-side diff display when page content changes on re-crawl. Old vs new content with change highlighting.
  163. Future
  164. Scheduled Crawling future
  165. Per-project cron-style scheduling. "Auto New Session weekly" for always-fresh knowledgebases.
  166. Atlas .
  167. Crawler
  168. Topic-driven crawl, scrape & RAG engine. Built with FastAPI, React 19, MongoDB, and Claude.
  169. Key Files
  170. backend/server.py
  171. โ€” All routes, crawler, RAG
  172. frontend/src/lib/api.js
  173. โ€” Axios API client
  174. frontend/src/App.js
  175. โ€” React router + pages
  176. backend/tests/
  177. โ€” pytest integration suite
  178. memory/PRD.md
  179. โ€” Product requirements
  180. Data Types Supported
  181. Government ยท Regulatory ยท Statistical
  182. News ยท Industry ยท Association
  183. Commercial ยท Blog ยท Marketplace
  184. Competitor ยท Academic ยท Forum
  185. Web (general)