Skip to main content
The model is optimized for client-side inference in web browsers, eliminating the need for server infrastructure while maintaining responsive prediction times.

Browser requirements

Modern browser support is essential for TensorFlow.js compatibility and optimal performance.

Minimum requirements

RequirementSpecification
Browser versionChrome 57+, Firefox 52+, Safari 11+, Edge 79+
JavaScriptES6 support required
WebGLWebGL 2.0 recommended (WebGL 1.0 minimum)
Memory2 GB RAM minimum, 4 GB recommended
NetworkBroadband connection for initial model download

Chrome / Edge

Best performance with WebGL 2.0 and optimized TensorFlow.js backend

Firefox

Good performance with full WebGL support and SIMD acceleration

Safari

Compatible on macOS and iOS with WebGL support

Mobile browsers

Supported but slower on resource-constrained devices

Model loading performance

Initial load time

The model requires a one-time download when first accessed:
  • Model size: 99.3 MB (model.json + 25 weight shards)
  • Network speed impact:
    • Broadband (10 Mbps): ~80 seconds
    • Fast connection (50 Mbps): ~16 seconds
    • Very fast (100 Mbps): ~8 seconds
Browsers automatically cache the model files, so subsequent page visits load almost instantly from cache.

Loading optimization

The model uses parallel shard loading to maximize download efficiency:
async function loadModel() {
    console.log("Loading Model");
    model = await tf.loadLayersModel('cnn_model/model.json');
    console.log("Loaded Model");
    
    loadingmodel.innerHTML = "Loaded ML Model";
    progressbar.style.display = "none";
}
Loading sequence:
  1. Download model.json (architecture definition)
  2. Parse layer configuration and weight manifest
  3. Download 25 weight shards in parallel
  4. Reconstruct model weights from shards
  5. Initialize TensorFlow.js computation graph
Browser caching significantly improves load times for returning users:
  • First visit: Full download (~99.3 MB)
  • Subsequent visits: Cache validation only (~1-2 seconds)
  • Cache duration: Controlled by HTTP cache headers
  • Storage: Model files stored in browser HTTP cache
Recommendation: Set appropriate Cache-Control headers when serving model files:
Cache-Control: public, max-age=31536000, immutable

Inference performance

Prediction speed

Once loaded, the model performs real-time inference:
HardwareBackendInference time
Desktop (GPU)WebGL 2.050-150 ms
Desktop (CPU)WASM/CPU200-500 ms
Mobile (high-end)WebGL150-400 ms
Mobile (mid-range)WebGL/CPU400-1000 ms
TensorFlow.js automatically selects the fastest available backend (WebGL, WASM, or CPU).

Inference workflow

The prediction process consists of several stages:
async function predict() {
    // 1. Image preprocessing (5-10 ms)
    let tensorImg = tf.browser.fromPixels(imgtag)
                    .resizeNearestNeighbor([75, 100])
                    .toFloat().expandDims();
    
    // 2. Model inference (50-500 ms depending on hardware)
    model.predict(tensorImg).data().then(
        function (prediction) {
            // 3. Post-processing (1-2 ms)
            let predicted_class = prediction.indexOf(Math.max(...prediction));
            
            // 4. Display results
            prediction_text.innerHTML = classes[predicted_class];
            probability_text.innerHTML = Math.round(prediction[predicted_class] * 100) + "% Confidence";
        }
    );
}
Performance breakdown:
  • Preprocessing: 5-10 ms (resize and tensor conversion)
  • Model forward pass: 50-500 ms (varies by hardware)
  • Post-processing: 1-2 ms (argmax and formatting)
  • Total: 56-512 ms typical end-to-end time

Memory usage

Runtime memory footprint

  • Model weights: ~99 MB in memory
  • Activation tensors: ~15-25 MB during inference
  • Input buffer: ~0.2 MB per image
  • Total peak: ~125-140 MB

Memory optimization

TensorFlow.js handles tensor lifecycle automatically:
  • Tensor disposal: Intermediate tensors freed after computation
  • Garbage collection: WebGL textures released when out of scope
  • Memory reuse: Buffers recycled across predictions
For optimal memory usage, avoid creating tensor references outside the prediction function.
On low-memory devices, consider these strategies:
  1. Limit concurrent predictions: Process one image at a time
  2. Manual tensor cleanup: Use tf.dispose() if needed
  3. Monitor memory: Use tf.memory() to track usage
// Check memory usage
console.log(tf.memory());

Optimization strategies

WebGL acceleration

The model leverages GPU acceleration when available:
WebGL backend provides 3-10x speedup compared to CPU-only execution.
WebGL benefits:
  • Parallel computation of convolution operations
  • Efficient matrix multiplications in dense layers
  • Hardware-accelerated activation functions
  • Reduced memory transfers between CPU and GPU
Checking active backend:
console.log(tf.getBackend()); // 'webgl', 'wasm', or 'cpu'

Model quantization

The current model uses float32 precision:
  • Accuracy: Full precision for medical-grade predictions
  • Trade-off: Larger file size vs. potential int8 quantization
  • Future optimization: Could reduce to ~25 MB with 4x quantization
Quantization for medical models requires careful validation to ensure diagnostic accuracy is not compromised.

Batch inference

While the current implementation processes single images:
// Current: Single image
tensorImg.shape // [1, 75, 100, 3]
Batch processing could improve throughput for multiple images:
// Potential: Batch of 4 images
batchTensor.shape // [4, 75, 100, 3]
Benefits: ~20-30% faster per-image inference when processing multiple images

Network considerations

Bandwidth optimization

Optimize model delivery with proper server configuration:Compression:
  • Enable gzip/brotli compression for .bin files
  • Typical compression ratio: 2-3x smaller transfer size
  • Example: 99 MB → 33-50 MB over network
CDN usage:
  • Serve model files from CDN for global distribution
  • Reduce latency with edge caching
  • Handle traffic spikes without server load
HTTP/2:
  • Multiplexed downloads of 25 shards
  • Reduced connection overhead
  • Better parallel loading performance

Offline support

The model can be cached for offline use:
// Service Worker example
self.addEventListener('install', (event) => {
    event.waitUntil(
        caches.open('skin-cancer-model-v1').then((cache) => {
            return cache.addAll([
                'cnn_model/model.json',
                'cnn_model/group1-shard1of25.bin',
                // ... all 25 shards
            ]);
        })
    );
});
Benefits:
  • Zero network latency on repeat visits
  • Offline functionality
  • Instant predictions without internet

Performance monitoring

Tracking inference time

async function predict() {
    const startTime = performance.now();
    
    let tensorImg = tf.browser.fromPixels(imgtag)
                    .resizeNearestNeighbor([75, 100])
                    .toFloat().expandDims();
    
    const prediction = await model.predict(tensorImg).data();
    
    const endTime = performance.now();
    console.log(`Inference time: ${endTime - startTime}ms`);
    
    // Process prediction...
}

Performance metrics to track

  • Model load time: Time from page load to model ready
  • Inference latency: Time per prediction
  • Memory usage: Peak memory during inference
  • Backend type: Which TensorFlow.js backend is active
  • Frame rate: For real-time video inference scenarios
Use browser DevTools Performance tab to profile TensorFlow.js operations and identify bottlenecks.

Scalability

Concurrent users

Client-side inference scales horizontally:
  • No server bottleneck: Each user runs inference locally
  • Zero backend load: Model computation happens in browser
  • Cost efficiency: No GPU server infrastructure required

Limitations

Client-side inference has inherent constraints:
  • Device capability: Performance varies widely across devices
  • Model size: 99 MB download may be prohibitive on slow connections
  • Browser compatibility: Older browsers lack WebGL support
  • Battery impact: GPU usage drains mobile battery faster
For production medical applications, consider a hybrid approach with optional server-side fallback for unsupported browsers or devices.

Build docs developers (and LLMs) love