Crawler Design & Configuration¶
Web crawler for endpoint discovery and vulnerability testing.
Crawler Philosophy¶
The crawler is conservative, safe, and ethical:
- Prevent DoS — Rate limiting respects server resources
- Respect policies — Honors robots.txt
- Avoid legal issues — Safe defaults
- Enable quick testing — Shallow crawls are fast
Default Configuration¶
const DEFAULT_CRAWLER_OPTIONS = {
maxDepth: 2, // Max 2 link levels deep
maxPages: 50, // Stop after 50 pages
rateLimit: 1000, // 1 second between requests
respectRobotsTxt: true, // Check robots.txt
allowExternalLinks: false, // Same-origin only
timeout: 10000 // 10-second per-request timeout
};
Crawl Strategy¶
Breadth-First Traversal¶
- Start at root URL (depth 0)
- Extract all links from page
- Add new URLs to queue with depth + 1
- Continue until maxDepth or maxPages reached
Discovery Methods¶
| Method | Example |
|---|---|
| HTML links | <a href="/page"> → /page |
| Form actions | <form action="/submit"> |
| AJAX calls | fetch('/api/data') → /api/data |
| Redirects | 301/302 to /new → /new |
| Sitemaps | sitemap.xml entries |
Rate Limiting¶
Prevents overwhelming the target:
Benefit: Respects server resources, avoids DoS detection
Trade-off: 50 pages = ~50 seconds minimum
Robots.txt Compliance¶
Respects website owner preferences:
The crawler will: - ✅ Skip /admin/ and /private/ - ✅ Wait 1 second between requests - ✅ Check sitemap.xml for URLs
Configuration Example¶
const crawler = new WebCrawler({
targetUrl: 'https://example.com',
maxDepth: 3, // Deeper crawl
maxPages: 100, // More pages
rateLimit: 500, // Faster (2 req/sec)
timeout: 15000 // Longer timeout
});
const results = await crawler.crawl();
Real-Time Logging¶
Progress updates via Server-Sent Events:
• 10:30:51 [DYNAMIC] Starting URL crawl...
• 10:30:52 [DYNAMIC] Discovered /products
• 10:30:53 [DYNAMIC] Discovered /api/users
• 10:30:54 [DYNAMIC] Discovered /contact
✓ 10:30:55 [DYNAMIC] Crawl completed. Found 20 URLs
Next Steps¶
- Dynamic Testing — How tests run
- Architecture — System design