Automated B2B Supplier Data Collection Platform with Anti-Detection & IP Rotation
A sourcing team needed to build a comprehensive supplier database across 19+ product categories and 50+ countries by collecting structured business data from B2B marketplace platforms — at scale, reliably, and without being blocked.
Discuss Your Project
The Challenge
Building a large-scale supplier database from B2B platforms presented multiple technical obstacles:
- Anti-Bot Detection — Target platforms employed sophisticated bot detection including browser fingerprinting, behavioral analysis, CAPTCHA challenges, and rate limiting
- Format Inconsistency — Supplier profile layouts varied significantly across categories and regions, breaking rigid scraping templates
- IP Blocking — High-volume requests from single IPs triggered permanent bans within minutes
- Data Volume — 50,000+ supplier profiles needed across dozens of categories with 80+ fields per record
- Data Quality — Extracted data contained duplicates, incomplete records, and inconsistent formats requiring validation
- Session Management — Long-running scraping sessions degraded over time as platforms detected automated patterns
Our Solution
We built an automated B2B data collection platform with multi-layered anti-detection, VPN-based IP rotation, human behavior simulation, and structured data export — capable of reliably collecting tens of thousands of supplier records.
Architecture
- Scraping Engine: Selenium with undetected ChromeDriver for browser automation with evasion
- Anti-Detection Layer: Browser fingerprint randomization, human behavior simulation, and CAPTCHA detection
- IP Rotation: VPN manager with programmatic server switching across 12+ global locations
- Data Processing: Pydantic models for validation, pandas for transformation, multi-format export
- Configuration: YAML-based settings for categories, countries, rate limits, and anti-detection parameters
- Logging & Monitoring: Structured logging with success/failure rate tracking per session
Anti-Detection Architecture
Browser Fingerprint Evasion
The platform generates randomized browser fingerprints for each session covering:
- Screen resolution, color depth, and device pixel ratio
- Navigator properties (platform, language, hardware concurrency)
- WebGL vendor and renderer information
- Canvas and audio fingerprint noise injection
- Realistic plugin and font lists matching the spoofed platform
- Timezone consistency across all fingerprint properties
Human Behavior Simulation
To mimic natural browsing patterns, the system implements:
- Mouse Movement — Bézier curve-based paths with realistic acceleration and deceleration
- Typing Simulation — Variable typing speeds with occasional realistic errors
- Scrolling Patterns — Multiple behavioral modes (careful reading, quick scanning, distracted browsing)
- Click Hesitation — Natural delays before interactions
- Session Fatigue — Behavior changes over long sessions to mimic human fatigue
- Break Simulation — Random pauses for extended sessions
CAPTCHA Detection & Recovery
- Multi-type detection (reCAPTCHA, hCaptcha, Cloudflare challenges, slider CAPTCHAs)
- Confidence scoring for each detection
- Recovery strategies including IP rotation, session reset, and extended delays
- Evidence collection (screenshots and HTML) for debugging
IP Rotation System
VPN Management
- Programmatic VPN connection management across 12+ global server locations
- Automatic connection health verification via IP checks
- Failed server blacklisting to avoid problematic locations
- Configurable rotation intervals (e.g., every N requests)
- Request counting for automatic rotation triggers
- Seamless rotation without interrupting active scraping sessions
Data Extraction & Processing
Extracted Data Fields (80+)
The platform extracts comprehensive supplier information across several categories:
- Basic Info — Company name, location (country, province, city), category
- Contact Details — Email, phone, WhatsApp, website, messaging handles
- Business Metrics — Business type, years in operation, annual revenue, employee count, factory size, verification status, response rate
- Product Info — Main products, categories, MOQ, price ranges, lead times, payment terms, customization options
- Certifications — Industry certifications (ISO, quality, sustainability, safety)
- Trade Info — Export percentage, target markets, trade terms, production capacity
Data Validation & Quality
- Pydantic models enforce field types, formats, and constraints
- Email and phone number format validation
- URL normalization and verification
- Duplicate detection across email, phone, and company name
- Minimum data completeness threshold (60%+ field coverage required)
- Business type classification and normalization
Export & Organization
Data is exported in multiple formats (CSV, Excel with formatting, JSON) and organized by:
- Category — Separate datasets per product category
- Country — Separate datasets per supplier country
- Master Lists — Combined datasets with cross-category deduplication
- Summary Reports — Statistics on extraction rates, coverage, and data quality
Configuration System
All behavior is controlled via YAML configuration covering:
- Category definitions with subcategories and search terms
- Target countries and priority regions
- Rate limiting (requests per minute, hour, and day)
- Anti-detection settings (rotation intervals, cookie clearing, behavioral flags)
- Extraction field requirements (required vs. optional)
- Export settings (deduplication, validation, completeness thresholds)
Key Features
- Multi-Layer Anti-Detection — Fingerprint evasion, behavior simulation, and session management
- VPN-Based IP Rotation — 12+ global locations with automatic rotation and health checks
- 80+ Data Fields — Comprehensive supplier profiles with validated, structured data
- Human Behavior Simulation — Bézier mouse paths, variable typing, realistic scrolling patterns
- CAPTCHA Detection & Recovery — Multi-type detection with automated recovery strategies
- Multi-Format Export — CSV, Excel, and JSON with category/country organization
- Data Validation — Pydantic-enforced schemas with duplicate detection and completeness scoring
- Configurable Campaigns — YAML-driven category, country, and rate limit configuration
- Session Management — Fatigue simulation, cookie rotation, and break scheduling
- Production Shell Scripts — Pre-configured runners for different scraping profiles
Results
Technology Stack
More Case Studies
Explore more of our technical implementations
AI-Powered Blog Content Scraping & Generation Platform
A media company needed an intelligent content platform that could automate blog content creation by scraping existing web content, analyzing it using AI, and generating original, SEO-optimized blog posts from the extracted data.
Custom WordPress Theme Redevelopment
Krystelis needed their existing WordPress website rebuilt from a pre-built theme into a fully custom WordPress theme, maintaining the original design while gaining complete control over the codebase for better customization, performance, and maintainability.
Multi-Tenant VR Training SaaS Platform
An enterprise training company needed to transform their VR-based training application into a multi-tenant SaaS platform capable of serving multiple organizations with separate user management, training tracking, and analytics.
Frequently Asked Questions
MicrocosmWorks implemented a multi-layered evasion system including residential proxy rotation across 50+ countries, browser fingerprint randomization using Playwright with stealth plugins, and human-like request pacing with randomized delays. The system maintains a detection rate below 2% across target sites by mimicking natural browsing patterns and rotating user agent strings.
MicrocosmWorks configured an intelligent proxy management layer that distributes requests across residential, datacenter, and mobile proxy pools based on each target site's detection sensitivity. The system tracks per-IP request counts and automatically retires IPs approaching rate limits, with a pool of over 10,000 rotating IPs ensuring continuous collection capacity.
MicrocosmWorks built a validation pipeline that verifies email deliverability, phone number format and carrier lookup, website availability, and address geocoding for every collected supplier record. Duplicate detection uses fuzzy matching on company name and address fields to prevent duplicate entries, and completeness scores flag records missing critical fields for re-scraping.
MicrocosmWorks implemented an automated structure monitoring system that compares page DOM structures against stored baselines on every crawl cycle. When structural changes are detected that break more than 10% of selectors, the system pauses collection for that source, alerts the operations team, and in many cases auto-repairs selectors using an LLM-based selector regeneration module.
MicrocosmWorks delivers web scraping platforms at rates of $20-$40/hr, with a full supplier data collection system including anti-detection measures, IP rotation, validation pipeline, and admin dashboard typically requiring 400-600 development hours. Ongoing proxy costs for large-scale operations typically run $500-$2,000/month depending on collection volume.
Have a Similar Project in Mind?
Let's discuss how we can build a solution tailored to your needs.