π SpiderNix Advanced Features
Version 0.2.0 introduces enterprise-grade features for advanced crawling, monitoring, and UX improvements.
π Table of Contentsβ
- Adaptive Rate Limiting
- Circuit Breaker Pattern
- Request Deduplication
- Real-time Monitoring
- Smart Link Prioritization
- Configuration Presets
- Interactive Wizard
- HTML Report Generation
- CAPTCHA Detection
- Session Management
- CLI Commands
π― Adaptive Rate Limitingβ
Automatically adjusts request rate based on server responses to avoid blocks and optimize crawl speed.
Featuresβ
- Backpressure Detection: Detects when server is under load (429, 503 responses, slow response times)
- Dynamic Adjustment: Increases delay when blocked, decreases when server responds well
- Configurable Thresholds: Customize min/max delays and adjustment factors
Usageβ
from spider_nix import SpiderNix, CrawlerConfig
config = CrawlerConfig()
crawler = SpiderNix(
config=config,
enable_adaptive_rate_limiting=True, # Enable adaptive rate limiting
)
results = await crawler.crawl("https://example.com", max_pages=100)
# Check rate limiter stats
if crawler.rate_limiter:
stats = crawler.rate_limiter.get_stats()
print(f"Current delay: {stats.current_delay_ms}ms")
print(f"Backpressure detected: {stats.backpressure_detected}")
CLI Usageβ
# Adaptive rate limiting is enabled by default in advanced-crawl
spider-nix advanced-crawl https://example.com --pages 100
β‘ Circuit Breaker Patternβ
Prevents cascading failures by temporarily stopping requests to failing servers.
Featuresβ
- 3 States: Closed (normal), Open (failing), Half-Open (testing recovery)
- Automatic Recovery: Tests server health after timeout
- Configurable Thresholds: Customize failure counts and timeout periods
Usageβ
from spider_nix import SpiderNix, CircuitBreakerConfig
# Configure circuit breaker
cb_config = CircuitBreakerConfig(
failure_threshold=5, # Open after 5 failures
success_threshold=2, # Close after 2 successes in half-open
timeout_seconds=60.0, # Wait 60s before testing recovery
)
crawler = SpiderNix(
enable_circuit_breaker=True,
)
# Circuit breaker automatically protects requests
results = await crawler.crawl("https://example.com")
# Check state
if crawler.circuit_breaker:
print(f"Circuit state: {crawler.circuit_breaker.get_state()}")
Statesβ
- CLOSED: Normal operation, all requests allowed
- OPEN: Too many failures, requests rejected immediately
- HALF_OPEN: Testing recovery with limited requests
π Request Deduplicationβ
Prevents crawling duplicate URLs and content, saving bandwidth and time.
Featuresβ
- URL Normalization: Sorts query params, removes fragments, normalizes domains
- Content Hashing: Detects duplicate content even with different URLs
- TTL Cache: Automatically expires old entries
- Memory Efficient: Limits cache size
Usageβ
from spider_nix import SpiderNix
crawler = SpiderNix(
enable_deduplication=True, # Enable deduplication
)
results = await crawler.crawl("https://example.com", follow_links=True)
# Check deduplication stats
if crawler.deduplicator:
stats = crawler.deduplicator.get_stats()
print(f"URLs cached: {stats['url_cache_size']}")
print(f"Content cached: {stats['content_cache_size']}")
URL Normalization Exampleβ
from spider_nix import RequestDeduplicator
dedup = RequestDeduplicator()
# These are considered the same:
url1 = "https://example.com/page?b=2&a=1#section"
url2 = "https://example.com/page?a=1&b=2"
normalized1 = dedup.normalize_url(url1)
normalized2 = dedup.normalize_url(url2)
# Both become: "https://example.com/page?a=1&b=2"
π Real-time Monitoringβ
Beautiful terminal UI showing live crawl statistics and progress.
Featuresβ
- Live Progress Bar: Shows crawl completion in real-time
- Overview Stats: Total requests, success rate, speed
- Performance Metrics: Response times, rate limiting status
- Status Code Distribution: Visual breakdown of HTTP responses
- Response Time Buckets: Histogram of response time distribution
Usageβ
from spider_nix import SpiderNix, CrawlMonitor
monitor = CrawlMonitor(max_pages=100, show_live=True)
monitor.start()
try:
crawler = SpiderNix()
results = await crawler.crawl("https://example.com")
# Update monitor with results
for result in results:
monitor.update(
url=result.url,
status_code=result.status_code,
response_time_ms=result.metadata.get("elapsed_ms", 0),
success=200 <= result.status_code < 300,
)
finally:
monitor.stop()
monitor.print_summary()
CLI Usageβ
# Monitoring enabled by default
spider-nix advanced-crawl https://example.com --pages 100 --monitor
Display Panelsβ
- Overview Panel: Total requests, success rate, duplicates
- Performance Panel: Response times, backpressure, circuit state
- Status Codes Panel: Top 10 status codes with visual bars
π― Smart Link Prioritizationβ
Intelligent link ordering for focused and efficient crawling.
Featuresβ
- Pattern-based Scoring: Prioritize URLs matching patterns (e.g.,
/api/,/docs/) - Keyword Scoring: Boost priority for relevant keywords
- Depth Control: Prioritize shallow or deep links
- Multiple Strategies: Breadth-first, depth-first, or focused crawling
Usageβ
from spider_nix import LinkPrioritizer
# Create custom prioritizer
prioritizer = LinkPrioritizer(
pattern_scores={
r"/api/": 10.0, # High priority for API endpoints
r"/docs/": 8.0, # High priority for documentation
r"/blog/": 2.0, # Lower priority for blog posts
},
keyword_scores={
"documentation": 5.0,
"tutorial": 4.0,
},
depth_penalty=0.5, # Penalize deep links
)
# Add links
await prioritizer.add_link("https://example.com/api/v1", depth=1)
await prioritizer.add_link("https://example.com/blog/post", depth=1)
# Get next highest-priority link
link = await prioritizer.get_next_link()
print(f"Crawl next: {link.url} (priority: {-link.priority:.2f})")
Preset Prioritizersβ
from spider_nix import (
BreadthFirstPrioritizer, # Shallow links first
DepthFirstPrioritizer, # Deep links first
FocusedCrawlPrioritizer, # Keyword-focused
)
# Breadth-first crawling
prioritizer = BreadthFirstPrioritizer()
# Focused on API documentation
prioritizer = FocusedCrawlPrioritizer(
focus_keywords=["api", "documentation", "reference"]
)
βοΈ Configuration Presetsβ
Pre-configured settings for common use cases.
Available Presetsβ
| Preset | Description | Best For |
|---|---|---|
balanced | Default balanced configuration | General purpose |
aggressive | 50 concurrent, minimal delays | Fast scraping |
stealth | Browser mode, long delays, low concurrency | Avoiding detection |
fast | 30 concurrent, short timeouts | Quick scans |
api | Optimized for API endpoints | REST API scraping |
browser | Heavy browser usage, JavaScript | Dynamic sites |
research | High limits, SQLite storage | Large-scale research |
Usageβ
from spider_nix import get_preset, SpiderNix
# Load preset
config = get_preset("stealth")
# Customize if needed
config.max_requests_per_crawl = 500
# Use with crawler
crawler = SpiderNix(config=config)
CLI Usageβ
# List presets
spider-nix presets
# Use preset
spider-nix advanced-crawl https://example.com --preset stealth --pages 100
Preset Detailsβ
Aggressiveβ
max_concurrent_requests: 50
max_retries: 10
human_like_delays: False
min_delay_ms: 100
max_delay_ms: 500
Stealthβ
max_concurrent_requests: 3
use_browser: True
human_like_delays: True
min_delay_ms: 2000
max_delay_ms: 5000
π§ Interactive Wizardβ
Guided configuration setup with interactive prompts.
Featuresβ
- Preset Selection: Start from a preset or build from scratch
- Step-by-step Customization: Configure basic, stealth, proxy, browser, and output settings
- Rich UI: Beautiful terminal interface with tables and panels
- Config Export: Save configuration to JSON file
Usageβ
# Run wizard
spider-nix wizard
# Follow interactive prompts:
# 1. Choose preset or start from scratch
# 2. Customize basic settings (pages, concurrency, timeouts)
# 3. Configure stealth settings
# 4. Setup proxies
# 5. Configure browser options
# 6. Choose output format
# 7. Review and save
Programmatic Usageβ
from spider_nix import run_wizard
# Run interactive wizard
config = run_wizard()
# Use the config
crawler = SpiderNix(config=config)
π HTML Report Generationβ
Generate beautiful HTML reports with charts and visualizations.
Featuresβ
- Interactive Charts: Status codes, response times, timeline (Chart.js)
- Summary Statistics: Success rate, avg response time, total requests
- Results Table: Detailed view of crawled URLs
- Responsive Design: Works on all screen sizes
- Professional Styling: Gradient backgrounds, hover effects
Usageβ
from spider_nix import generate_report, SpiderNix
# Crawl
crawler = SpiderNix()
results = await crawler.crawl("https://example.com")
# Generate report
report_path = generate_report(
results=results,
output_path="crawl_report.html",
title="My Crawl Report",
)
print(f"Report saved to: {report_path}")
CLI Usageβ
# Generate report during crawl
spider-nix advanced-crawl https://example.com \
--pages 100 \
--report \
--report-path report.html
# Generate report from existing results
spider-nix generate-html-report results.json \
--output report.html \
--title "My Crawl Report"
Report Sectionsβ
- Summary: Overview statistics with color-coded metrics
- Status Code Distribution: Bar chart of HTTP status codes
- Response Time Distribution: Histogram of response times
- Requests Timeline: Line chart showing requests over time
- Crawl Results: Detailed table of URLs with metadata
π CAPTCHA Detectionβ
Automatically detect CAPTCHA challenges during crawling.
Supported CAPTCHAsβ
- Google reCAPTCHA
- hCaptcha
- FunCaptcha / Arkose Labs
- Cloudflare Challenge
- AWS WAF Captcha
- Generic CAPTCHA patterns
Usageβ
from spider_nix import CaptchaDetector
detector = CaptchaDetector()
# Detect from response
import httpx
async with httpx.AsyncClient() as client:
response = await client.get("https://example.com")
is_captcha, captcha_type = detector.detect(response=response)
if is_captcha:
print(f"CAPTCHA detected: {captcha_type}")
# Handle CAPTCHA (pause, notify, solve, etc.)
Detection Methodsβ
- Status Codes: 403, 429 with CAPTCHA indicators
- HTML Patterns: Common CAPTCHA service signatures
- Headers: CAPTCHA-related response headers
- Keywords: Challenge, verification, robot detection
π Session Managementβ
Manage authenticated sessions for crawling protected content.
Featuresβ
- Login Automation: Automated login with credentials
- Cookie Management: Persist and rotate cookies
- CSRF Token Extraction: Automatically extract and use CSRF tokens
- Session Expiry: Automatic session refresh
- Multi-session Support: Manage multiple authenticated sessions
Usageβ
from spider_nix import SessionManager
manager = SessionManager(
session_ttl_minutes=60, # Session expires after 60 min
auto_refresh=True, # Auto-refresh before expiry
)
# Create session with login
session = await manager.create_session(
session_id="my_session",
login_url="https://example.com/login",
credentials={
"username": "user@example.com",
"password": "password123",
},
)
# Apply session to HTTP client
import httpx
async with httpx.AsyncClient() as client:
manager.apply_session_to_client(client, session)
# Now all requests use authenticated session
response = await client.get("https://example.com/protected")
# List active sessions
sessions = manager.list_sessions()
Custom Login Handlerβ
async def custom_login(credentials):
# Implement custom login logic
async with httpx.AsyncClient() as client:
# Step 1: Get login page
response = await client.get("https://example.com/login")
# Step 2: Submit login form with CSRF token
csrf_token = extract_csrf(response.text)
response = await client.post(
"https://example.com/login",
data={
**credentials,
"csrf_token": csrf_token,
}
)
return Session(
cookies=dict(response.cookies),
tokens={"csrf": csrf_token},
)
# Use custom login
session = await manager.create_session(
session_id="custom",
custom_login_handler=custom_login,
credentials={"username": "user", "password": "pass"},
)
π» CLI Commandsβ
New Commandsβ
wizardβ
Interactive configuration wizard
spider-nix wizard
presetsβ
List available configuration presets
spider-nix presets
advanced-crawlβ
Advanced crawl with all features enabled
spider-nix advanced-crawl https://example.com \
--pages 100 \
--preset stealth \
--follow \
--monitor \
--report \
--output results.json
Options:
--preset: Use configuration preset--pages: Max pages to crawl--follow: Follow links--monitor: Show live monitoring (default: true)--report: Generate HTML report--report-path: Path to save HTML report--output: Output file path--format: Output format (json, csv, sqlite)
generate-html-reportβ
Generate HTML report from existing results
spider-nix generate-html-report results.json \
--output report.html \
--title "My Crawl Report"
Updated Commandsβ
crawl (legacy)β
Standard crawl command (still available for backward compatibility)
spider-nix crawl https://example.com --pages 10
π§ Programmatic APIβ
Complete Exampleβ
import asyncio
from spider_nix import (
SpiderNix,
get_preset,
CrawlMonitor,
generate_report,
)
async def advanced_crawl():
# Load preset config
config = get_preset("balanced")
config.max_requests_per_crawl = 100
# Initialize crawler with all features
crawler = SpiderNix(
config=config,
enable_adaptive_rate_limiting=True,
enable_circuit_breaker=True,
enable_deduplication=True,
)
# Setup monitoring
monitor = CrawlMonitor(max_pages=100, show_live=True)
monitor.start()
try:
# Crawl
results = await crawler.crawl(
"https://example.com",
max_pages=100,
follow_links=True,
)
# Update monitor
for result in results:
monitor.update(
url=result.url,
status_code=result.status_code,
response_time_ms=result.metadata.get("elapsed_ms", 0),
success=200 <= result.status_code < 300,
)
# Print stats
monitor.print_summary()
# Generate report
report_path = generate_report(
results=results,
stats=monitor.stats,
title="Advanced Crawl Report",
)
print(f"Report: {report_path}")
return results
finally:
monitor.stop()
# Run
results = asyncio.run(advanced_crawl())
π Additional Resourcesβ
- Main README - Project overview and installation
- CHANGELOG - Version history
- CONTRIBUTING - Contribution guidelines
- API Documentation - Full API reference
π What's New in v0.2.0β
Advanced Crawlingβ
- β Adaptive rate limiting with backpressure detection
- β Circuit breaker pattern for fault tolerance
- β Request deduplication (URL + content)
- β Smart link prioritization with scoring
UX Improvementsβ
- β Real-time monitoring with rich UI
- β HTML report generation with charts
- β Interactive configuration wizard
- β 7 configuration presets
Security & Authenticationβ
- β CAPTCHA detection (reCAPTCHA, hCaptcha, etc.)
- β Session management for authenticated crawling
- β Cookie and CSRF token handling
Developer Experienceβ
- β Enhanced CLI with new commands
- β Comprehensive API documentation
- β Type-safe interfaces with Pydantic
- β Better error messages and logging
π Performance Comparisonβ
| Feature | v0.1.0 | v0.2.0 |
|---|---|---|
| Rate Limiting | Fixed delays | Adaptive (dynamic) |
| Deduplication | None | URL + Content |
| Monitoring | Basic console logs | Real-time rich UI |
| Reports | JSON/CSV only | HTML with charts |
| Configuration | Manual code | Presets + Wizard |
| Circuit Breaker | None | β 3-state pattern |
| Link Priority | FIFO queue | Smart scoring |
π Licenseβ
MIT License - See LICENSE for details
π€ Contributingβ
Contributions welcome! See CONTRIBUTING.md
Built with β€οΈ using Python, httpx, Playwright, Rich, and Chart.js