Crawler Design & Configuration¶

Web crawler for endpoint discovery and vulnerability testing.

Crawler Philosophy¶

The crawler is conservative, safe, and ethical:

Prevent DoS — Rate limiting respects server resources
Respect policies — Honors robots.txt
Avoid legal issues — Safe defaults
Enable quick testing — Shallow crawls are fast

Default Configuration¶

const DEFAULT_CRAWLER_OPTIONS = {
  maxDepth: 2,              // Max 2 link levels deep
  maxPages: 50,             // Stop after 50 pages
  rateLimit: 1000,          // 1 second between requests
  respectRobotsTxt: true,   // Check robots.txt
  allowExternalLinks: false, // Same-origin only
  timeout: 10000            // 10-second per-request timeout
};

Crawl Strategy¶

Breadth-First Traversal¶

Start at root URL (depth 0)
Extract all links from page
Add new URLs to queue with depth + 1
Continue until maxDepth or maxPages reached

Discovery Methods¶

Method	Example
HTML links	`<a href="/page">` → `/page`
Form actions	`<form action="/submit">`
AJAX calls	`fetch('/api/data')` → `/api/data`
Redirects	301/302 to `/new` → `/new`
Sitemaps	`sitemap.xml` entries

Rate Limiting¶

Prevents overwhelming the target:

Time:     0ms   1000ms  2000ms  3000ms
Request: [ 1 ]           [ 2 ]           [ 3 ]
         <----1 sec---><----1 sec---->

Benefit: Respects server resources, avoids DoS detection

Trade-off: 50 pages = ~50 seconds minimum

Robots.txt Compliance¶

Respects website owner preferences:

User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 1

Sitemap: /sitemap.xml

The crawler will: - ✅ Skip /admin/ and /private/ - ✅ Wait 1 second between requests - ✅ Check sitemap.xml for URLs

Configuration Example¶

const crawler = new WebCrawler({
  targetUrl: 'https://example.com',
  maxDepth: 3,          // Deeper crawl
  maxPages: 100,        // More pages
  rateLimit: 500,       // Faster (2 req/sec)
  timeout: 15000        // Longer timeout
});

const results = await crawler.crawl();

Real-Time Logging¶

Progress updates via Server-Sent Events:

• 10:30:51 [DYNAMIC] Starting URL crawl...
• 10:30:52 [DYNAMIC] Discovered /products
• 10:30:53 [DYNAMIC] Discovered /api/users
• 10:30:54 [DYNAMIC] Discovered /contact
✓ 10:30:55 [DYNAMIC] Crawl completed. Found 20 URLs

Next Steps¶

Dynamic Testing — How tests run
Architecture — System design