Back to Case Studies
Web Scraping

Automated B2B Supplier Data Collection Platform with Anti-Detection & IP Rotation

A sourcing team needed to build a comprehensive supplier database across 19+ product categories and 50+ countries by collecting structured business data from B2B marketplace platforms — at scale, reliably, and without being blocked.

Discuss Your Project
Automated B2B Supplier Data Collection Platform with Anti-Detection & IP Rotation
Web Scraping
Domain
12
Technologies
6
Key Results
Delivered
Status

The Challenge

Building a large-scale supplier database from B2B platforms presented multiple technical obstacles:

  • Anti-Bot Detection — Target platforms employed sophisticated bot detection including browser fingerprinting, behavioral analysis, CAPTCHA challenges, and rate limiting
  • Format Inconsistency — Supplier profile layouts varied significantly across categories and regions, breaking rigid scraping templates
  • IP Blocking — High-volume requests from single IPs triggered permanent bans within minutes
  • Data Volume — 50,000+ supplier profiles needed across dozens of categories with 80+ fields per record
  • Data Quality — Extracted data contained duplicates, incomplete records, and inconsistent formats requiring validation
  • Session Management — Long-running scraping sessions degraded over time as platforms detected automated patterns

Our Solution

We built an automated B2B data collection platform with multi-layered anti-detection, VPN-based IP rotation, human behavior simulation, and structured data export — capable of reliably collecting tens of thousands of supplier records.

Architecture

  • Scraping Engine: Selenium with undetected ChromeDriver for browser automation with evasion
  • Anti-Detection Layer: Browser fingerprint randomization, human behavior simulation, and CAPTCHA detection
  • IP Rotation: VPN manager with programmatic server switching across 12+ global locations
  • Data Processing: Pydantic models for validation, pandas for transformation, multi-format export
  • Configuration: YAML-based settings for categories, countries, rate limits, and anti-detection parameters
  • Logging & Monitoring: Structured logging with success/failure rate tracking per session

Anti-Detection Architecture

Browser Fingerprint Evasion

The platform generates randomized browser fingerprints for each session covering:

  • Screen resolution, color depth, and device pixel ratio
  • Navigator properties (platform, language, hardware concurrency)
  • WebGL vendor and renderer information
  • Canvas and audio fingerprint noise injection
  • Realistic plugin and font lists matching the spoofed platform
  • Timezone consistency across all fingerprint properties

Human Behavior Simulation

To mimic natural browsing patterns, the system implements:

  • Mouse Movement — Bézier curve-based paths with realistic acceleration and deceleration
  • Typing Simulation — Variable typing speeds with occasional realistic errors
  • Scrolling Patterns — Multiple behavioral modes (careful reading, quick scanning, distracted browsing)
  • Click Hesitation — Natural delays before interactions
  • Session Fatigue — Behavior changes over long sessions to mimic human fatigue
  • Break Simulation — Random pauses for extended sessions

CAPTCHA Detection & Recovery

  • Multi-type detection (reCAPTCHA, hCaptcha, Cloudflare challenges, slider CAPTCHAs)
  • Confidence scoring for each detection
  • Recovery strategies including IP rotation, session reset, and extended delays
  • Evidence collection (screenshots and HTML) for debugging

IP Rotation System

VPN Management

  • Programmatic VPN connection management across 12+ global server locations
  • Automatic connection health verification via IP checks
  • Failed server blacklisting to avoid problematic locations
  • Configurable rotation intervals (e.g., every N requests)
  • Request counting for automatic rotation triggers
  • Seamless rotation without interrupting active scraping sessions

Data Extraction & Processing

Extracted Data Fields (80+)

The platform extracts comprehensive supplier information across several categories:

  • Basic Info — Company name, location (country, province, city), category
  • Contact Details — Email, phone, WhatsApp, website, messaging handles
  • Business Metrics — Business type, years in operation, annual revenue, employee count, factory size, verification status, response rate
  • Product Info — Main products, categories, MOQ, price ranges, lead times, payment terms, customization options
  • Certifications — Industry certifications (ISO, quality, sustainability, safety)
  • Trade Info — Export percentage, target markets, trade terms, production capacity

Data Validation & Quality

  • Pydantic models enforce field types, formats, and constraints
  • Email and phone number format validation
  • URL normalization and verification
  • Duplicate detection across email, phone, and company name
  • Minimum data completeness threshold (60%+ field coverage required)
  • Business type classification and normalization

Export & Organization

Data is exported in multiple formats (CSV, Excel with formatting, JSON) and organized by:

  • Category — Separate datasets per product category
  • Country — Separate datasets per supplier country
  • Master Lists — Combined datasets with cross-category deduplication
  • Summary Reports — Statistics on extraction rates, coverage, and data quality

Configuration System

All behavior is controlled via YAML configuration covering:

  • Category definitions with subcategories and search terms
  • Target countries and priority regions
  • Rate limiting (requests per minute, hour, and day)
  • Anti-detection settings (rotation intervals, cookie clearing, behavioral flags)
  • Extraction field requirements (required vs. optional)
  • Export settings (deduplication, validation, completeness thresholds)

Key Features

  1. Multi-Layer Anti-Detection — Fingerprint evasion, behavior simulation, and session management
  2. VPN-Based IP Rotation — 12+ global locations with automatic rotation and health checks
  3. 80+ Data Fields — Comprehensive supplier profiles with validated, structured data
  4. Human Behavior Simulation — Bézier mouse paths, variable typing, realistic scrolling patterns
  5. CAPTCHA Detection & Recovery — Multi-type detection with automated recovery strategies
  6. Multi-Format Export — CSV, Excel, and JSON with category/country organization
  7. Data Validation — Pydantic-enforced schemas with duplicate detection and completeness scoring
  8. Configurable Campaigns — YAML-driven category, country, and rate limit configuration
  9. Session Management — Fatigue simulation, cookie rotation, and break scheduling
  10. Production Shell Scripts — Pre-configured runners for different scraping profiles

Results

Scale: Collected 50,000+ supplier records across 19+ categories and 50+ countries
Data Quality: 80+ fields per supplier with 60%+ completeness rate
Detection Avoidance: 60-80% reduction in CAPTCHA encounters vs. naive scraping
Contact Rate: 70-80% email availability, 80-90% phone availability across records
Duplicate Rate: < 5% after deduplication processing
Export: Organized datasets by category and country with master aggregation

Technology Stack

PythonSeleniumUndetected ChromeDriverBeautifulSoupScrapyPlaywrightPydanticpandasVPN IntegrationPyYAMLLoguruYAML Configuration

Have a Similar Project in Mind?

Let's discuss how we can build a solution tailored to your needs.

Contact UsSchedule Appointment