Back to Case Studies
Web Scraping

Automated B2B Supplier Data Collection Platform with Anti-Detection & IP Rotation

A sourcing team needed to build a comprehensive supplier database across 19+ product categories and 50+ countries by collecting structured business data from B2B marketplace platforms — at scale, reliably, and without being blocked.

Discuss Your Project
b2b-supplier-data-scraping.webp
Web Scraping
Domain
12
Technologies
6
Key Results
Delivered
Status

The Challenge

Building a large-scale supplier database from B2B platforms presented multiple technical obstacles:

  • Anti-Bot Detection — Target platforms employed sophisticated bot detection including browser fingerprinting, behavioral analysis, CAPTCHA challenges, and rate limiting
  • Format Inconsistency — Supplier profile layouts varied significantly across categories and regions, breaking rigid scraping templates
  • IP Blocking — High-volume requests from single IPs triggered permanent bans within minutes
  • Data Volume — 50,000+ supplier profiles needed across dozens of categories with 80+ fields per record
  • Data Quality — Extracted data contained duplicates, incomplete records, and inconsistent formats requiring validation
  • Session Management — Long-running scraping sessions degraded over time as platforms detected automated patterns

Our Solution

We built an automated B2B data collection platform with multi-layered anti-detection, VPN-based IP rotation, human behavior simulation, and structured data export — capable of reliably collecting tens of thousands of supplier records.

Architecture

  • Scraping Engine: Selenium with undetected ChromeDriver for browser automation with evasion
  • Anti-Detection Layer: Browser fingerprint randomization, human behavior simulation, and CAPTCHA detection
  • IP Rotation: VPN manager with programmatic server switching across 12+ global locations
  • Data Processing: Pydantic models for validation, pandas for transformation, multi-format export
  • Configuration: YAML-based settings for categories, countries, rate limits, and anti-detection parameters
  • Logging & Monitoring: Structured logging with success/failure rate tracking per session

Anti-Detection Architecture

Browser Fingerprint Evasion

The platform generates randomized browser fingerprints for each session covering:

  • Screen resolution, color depth, and device pixel ratio
  • Navigator properties (platform, language, hardware concurrency)
  • WebGL vendor and renderer information
  • Canvas and audio fingerprint noise injection
  • Realistic plugin and font lists matching the spoofed platform
  • Timezone consistency across all fingerprint properties

Human Behavior Simulation

To mimic natural browsing patterns, the system implements:

  • Mouse Movement — Bézier curve-based paths with realistic acceleration and deceleration
  • Typing Simulation — Variable typing speeds with occasional realistic errors
  • Scrolling Patterns — Multiple behavioral modes (careful reading, quick scanning, distracted browsing)
  • Click Hesitation — Natural delays before interactions
  • Session Fatigue — Behavior changes over long sessions to mimic human fatigue
  • Break Simulation — Random pauses for extended sessions

CAPTCHA Detection & Recovery

  • Multi-type detection (reCAPTCHA, hCaptcha, Cloudflare challenges, slider CAPTCHAs)
  • Confidence scoring for each detection
  • Recovery strategies including IP rotation, session reset, and extended delays
  • Evidence collection (screenshots and HTML) for debugging

IP Rotation System

VPN Management

  • Programmatic VPN connection management across 12+ global server locations
  • Automatic connection health verification via IP checks
  • Failed server blacklisting to avoid problematic locations
  • Configurable rotation intervals (e.g., every N requests)
  • Request counting for automatic rotation triggers
  • Seamless rotation without interrupting active scraping sessions

Data Extraction & Processing

Extracted Data Fields (80+)

The platform extracts comprehensive supplier information across several categories:

  • Basic Info — Company name, location (country, province, city), category
  • Contact Details — Email, phone, WhatsApp, website, messaging handles
  • Business Metrics — Business type, years in operation, annual revenue, employee count, factory size, verification status, response rate
  • Product Info — Main products, categories, MOQ, price ranges, lead times, payment terms, customization options
  • Certifications — Industry certifications (ISO, quality, sustainability, safety)
  • Trade Info — Export percentage, target markets, trade terms, production capacity

Data Validation & Quality

  • Pydantic models enforce field types, formats, and constraints
  • Email and phone number format validation
  • URL normalization and verification
  • Duplicate detection across email, phone, and company name
  • Minimum data completeness threshold (60%+ field coverage required)
  • Business type classification and normalization

Export & Organization

Data is exported in multiple formats (CSV, Excel with formatting, JSON) and organized by:

  • Category — Separate datasets per product category
  • Country — Separate datasets per supplier country
  • Master Lists — Combined datasets with cross-category deduplication
  • Summary Reports — Statistics on extraction rates, coverage, and data quality

Configuration System

All behavior is controlled via YAML configuration covering:

  • Category definitions with subcategories and search terms
  • Target countries and priority regions
  • Rate limiting (requests per minute, hour, and day)
  • Anti-detection settings (rotation intervals, cookie clearing, behavioral flags)
  • Extraction field requirements (required vs. optional)
  • Export settings (deduplication, validation, completeness thresholds)

Key Features

  1. Multi-Layer Anti-Detection — Fingerprint evasion, behavior simulation, and session management
  2. VPN-Based IP Rotation — 12+ global locations with automatic rotation and health checks
  3. 80+ Data Fields — Comprehensive supplier profiles with validated, structured data
  4. Human Behavior Simulation — Bézier mouse paths, variable typing, realistic scrolling patterns
  5. CAPTCHA Detection & Recovery — Multi-type detection with automated recovery strategies
  6. Multi-Format Export — CSV, Excel, and JSON with category/country organization
  7. Data Validation — Pydantic-enforced schemas with duplicate detection and completeness scoring
  8. Configurable Campaigns — YAML-driven category, country, and rate limit configuration
  9. Session Management — Fatigue simulation, cookie rotation, and break scheduling
  10. Production Shell Scripts — Pre-configured runners for different scraping profiles

Results

Scale: Collected 50,000+ supplier records across 19+ categories and 50+ countries
Data Quality: 80+ fields per supplier with 60%+ completeness rate
Detection Avoidance: 60-80% reduction in CAPTCHA encounters vs. naive scraping
Contact Rate: 70-80% email availability, 80-90% phone availability across records
Duplicate Rate: < 5% after deduplication processing
Export: Organized datasets by category and country with master aggregation

Technology Stack

PythonSeleniumUndetected ChromeDriverBeautifulSoupScrapyPlaywrightPydanticpandasVPN IntegrationPyYAMLLoguruYAML Configuration

Frequently Asked Questions

MicrocosmWorks implemented a multi-layered evasion system including residential proxy rotation across 50+ countries, browser fingerprint randomization using Playwright with stealth plugins, and human-like request pacing with randomized delays. The system maintains a detection rate below 2% across target sites by mimicking natural browsing patterns and rotating user agent strings.

MicrocosmWorks configured an intelligent proxy management layer that distributes requests across residential, datacenter, and mobile proxy pools based on each target site's detection sensitivity. The system tracks per-IP request counts and automatically retires IPs approaching rate limits, with a pool of over 10,000 rotating IPs ensuring continuous collection capacity.

MicrocosmWorks built a validation pipeline that verifies email deliverability, phone number format and carrier lookup, website availability, and address geocoding for every collected supplier record. Duplicate detection uses fuzzy matching on company name and address fields to prevent duplicate entries, and completeness scores flag records missing critical fields for re-scraping.

MicrocosmWorks implemented an automated structure monitoring system that compares page DOM structures against stored baselines on every crawl cycle. When structural changes are detected that break more than 10% of selectors, the system pauses collection for that source, alerts the operations team, and in many cases auto-repairs selectors using an LLM-based selector regeneration module.

MicrocosmWorks delivers web scraping platforms at rates of $20-$40/hr, with a full supplier data collection system including anti-detection measures, IP rotation, validation pipeline, and admin dashboard typically requiring 400-600 development hours. Ongoing proxy costs for large-scale operations typically run $500-$2,000/month depending on collection volume.

Have a Similar Project in Mind?

Let's discuss how we can build a solution tailored to your needs.

Contact UsSchedule Appointment