Performance

The model is optimized for client-side inference in web browsers, eliminating the need for server infrastructure while maintaining responsive prediction times.

Browser requirements

Modern browser support is essential for TensorFlow.js compatibility and optimal performance.

Minimum requirements

Requirement	Specification
Browser version	Chrome 57+, Firefox 52+, Safari 11+, Edge 79+
JavaScript	ES6 support required
WebGL	WebGL 2.0 recommended (WebGL 1.0 minimum)
Memory	2 GB RAM minimum, 4 GB recommended
Network	Broadband connection for initial model download

Recommended browsers

Chrome / Edge

Best performance with WebGL 2.0 and optimized TensorFlow.js backend

Firefox

Good performance with full WebGL support and SIMD acceleration

Safari

Compatible on macOS and iOS with WebGL support

Mobile browsers

Supported but slower on resource-constrained devices

Model loading performance

Initial load time

The model requires a one-time download when first accessed:

Model size: 99.3 MB (model.json + 25 weight shards)
Network speed impact:
- Broadband (10 Mbps): ~80 seconds
- Fast connection (50 Mbps): ~16 seconds
- Very fast (100 Mbps): ~8 seconds

Browsers automatically cache the model files, so subsequent page visits load almost instantly from cache.

Loading optimization

The model uses parallel shard loading to maximize download efficiency:

async function loadModel() {
    console.log("Loading Model");
    model = await tf.loadLayersModel('cnn_model/model.json');
    console.log("Loaded Model");
    
    loadingmodel.innerHTML = "Loaded ML Model";
    progressbar.style.display = "none";
}

Loading sequence:

Download model.json (architecture definition)
Parse layer configuration and weight manifest
Download 25 weight shards in parallel
Reconstruct model weights from shards
Initialize TensorFlow.js computation graph

Caching strategy

Browser caching significantly improves load times for returning users:

First visit: Full download (~99.3 MB)
Subsequent visits: Cache validation only (~1-2 seconds)
Cache duration: Controlled by HTTP cache headers
Storage: Model files stored in browser HTTP cache

Recommendation: Set appropriate Cache-Control headers when serving model files:

Cache-Control: public, max-age=31536000, immutable

Inference performance

Prediction speed

Once loaded, the model performs real-time inference:

Hardware	Backend	Inference time
Desktop (GPU)	WebGL 2.0	50-150 ms
Desktop (CPU)	WASM/CPU	200-500 ms
Mobile (high-end)	WebGL	150-400 ms
Mobile (mid-range)	WebGL/CPU	400-1000 ms

TensorFlow.js automatically selects the fastest available backend (WebGL, WASM, or CPU).

Inference workflow

The prediction process consists of several stages:

async function predict() {
    // 1. Image preprocessing (5-10 ms)
    let tensorImg = tf.browser.fromPixels(imgtag)
                    .resizeNearestNeighbor([75, 100])
                    .toFloat().expandDims();
    
    // 2. Model inference (50-500 ms depending on hardware)
    model.predict(tensorImg).data().then(
        function (prediction) {
            // 3. Post-processing (1-2 ms)
            let predicted_class = prediction.indexOf(Math.max(...prediction));
            
            // 4. Display results
            prediction_text.innerHTML = classes[predicted_class];
            probability_text.innerHTML = Math.round(prediction[predicted_class] * 100) + "% Confidence";
        }
    );
}

Performance breakdown:

Preprocessing: 5-10 ms (resize and tensor conversion)
Model forward pass: 50-500 ms (varies by hardware)
Post-processing: 1-2 ms (argmax and formatting)
Total: 56-512 ms typical end-to-end time

Memory usage

Runtime memory footprint

Model weights: ~99 MB in memory
Activation tensors: ~15-25 MB during inference
Input buffer: ~0.2 MB per image
Total peak: ~125-140 MB

Memory optimization

Automatic memory management

TensorFlow.js handles tensor lifecycle automatically:

Tensor disposal: Intermediate tensors freed after computation
Garbage collection: WebGL textures released when out of scope
Memory reuse: Buffers recycled across predictions

For optimal memory usage, avoid creating tensor references outside the prediction function.

Handling memory constraints

On low-memory devices, consider these strategies:

Limit concurrent predictions: Process one image at a time
Manual tensor cleanup: Use tf.dispose() if needed
Monitor memory: Use tf.memory() to track usage

// Check memory usage
console.log(tf.memory());

Optimization strategies

WebGL acceleration

The model leverages GPU acceleration when available:

WebGL backend provides 3-10x speedup compared to CPU-only execution.

WebGL benefits:

Parallel computation of convolution operations
Efficient matrix multiplications in dense layers
Hardware-accelerated activation functions
Reduced memory transfers between CPU and GPU

Checking active backend:

console.log(tf.getBackend()); // 'webgl', 'wasm', or 'cpu'

Model quantization

The current model uses float32 precision:

Accuracy: Full precision for medical-grade predictions
Trade-off: Larger file size vs. potential int8 quantization
Future optimization: Could reduce to ~25 MB with 4x quantization

Quantization for medical models requires careful validation to ensure diagnostic accuracy is not compromised.

Batch inference

While the current implementation processes single images:

// Current: Single image
tensorImg.shape // [1, 75, 100, 3]

Batch processing could improve throughput for multiple images:

// Potential: Batch of 4 images
batchTensor.shape // [4, 75, 100, 3]

Benefits: ~20-30% faster per-image inference when processing multiple images

Network considerations

Bandwidth optimization

Serving strategies

Optimize model delivery with proper server configuration:Compression:

Enable gzip/brotli compression for .bin files
Typical compression ratio: 2-3x smaller transfer size
Example: 99 MB → 33-50 MB over network

CDN usage:

Serve model files from CDN for global distribution
Reduce latency with edge caching
Handle traffic spikes without server load

HTTP/2:

Multiplexed downloads of 25 shards
Reduced connection overhead
Better parallel loading performance

Offline support

Progressive Web App integration

The model can be cached for offline use:

// Service Worker example
self.addEventListener('install', (event) => {
    event.waitUntil(
        caches.open('skin-cancer-model-v1').then((cache) => {
            return cache.addAll([
                'cnn_model/model.json',
                'cnn_model/group1-shard1of25.bin',
                // ... all 25 shards
            ]);
        })
    );
});

Benefits:

Zero network latency on repeat visits
Offline functionality
Instant predictions without internet

Performance monitoring

Tracking inference time

async function predict() {
    const startTime = performance.now();
    
    let tensorImg = tf.browser.fromPixels(imgtag)
                    .resizeNearestNeighbor([75, 100])
                    .toFloat().expandDims();
    
    const prediction = await model.predict(tensorImg).data();
    
    const endTime = performance.now();
    console.log(`Inference time: ${endTime - startTime}ms`);
    
    // Process prediction...
}

Performance metrics to track

Model load time: Time from page load to model ready
Inference latency: Time per prediction
Memory usage: Peak memory during inference
Backend type: Which TensorFlow.js backend is active
Frame rate: For real-time video inference scenarios

Use browser DevTools Performance tab to profile TensorFlow.js operations and identify bottlenecks.

Scalability

Concurrent users

Client-side inference scales horizontally:

No server bottleneck: Each user runs inference locally
Zero backend load: Model computation happens in browser
Cost efficiency: No GPU server infrastructure required

Limitations

Client-side inference has inherent constraints:

Device capability: Performance varies widely across devices
Model size: 99 MB download may be prohibitive on slow connections
Browser compatibility: Older browsers lack WebGL support
Battery impact: GPU usage drains mobile battery faster

For production medical applications, consider a hybrid approach with optional server-side fallback for unsupported browsers or devices.

Get Started

Understanding the Model

Using the API

Technical Details

Resources

Browser requirements

Minimum requirements

Recommended browsers

Chrome / Edge

Firefox

Safari

Mobile browsers

Model loading performance

Initial load time

Loading optimization

Inference performance

Prediction speed

Inference workflow

Memory usage

Runtime memory footprint

Memory optimization

Optimization strategies

WebGL acceleration

Model quantization

Batch inference

Network considerations

Bandwidth optimization

Offline support

Performance monitoring

Tracking inference time

Performance metrics to track

Scalability

Concurrent users

Limitations

Build docs developers (and LLMs) love

Get Started

Understanding the Model

Using the API

Technical Details

Resources

​Browser requirements

​Minimum requirements

​Recommended browsers

Chrome / Edge

Firefox

Safari

Mobile browsers

​Model loading performance

​Initial load time

​Loading optimization

​Inference performance

​Prediction speed

​Inference workflow

​Memory usage

​Runtime memory footprint

​Memory optimization

​Optimization strategies

​WebGL acceleration

​Model quantization

​Batch inference

​Network considerations

​Bandwidth optimization

​Offline support

​Performance monitoring

​Tracking inference time

​Performance metrics to track

​Scalability

​Concurrent users

​Limitations

Build docs developers (and LLMs) love

Browser requirements

Minimum requirements

Recommended browsers

Model loading performance

Initial load time

Loading optimization

Inference performance

Prediction speed

Inference workflow

Memory usage

Runtime memory footprint

Memory optimization

Optimization strategies

WebGL acceleration

Model quantization

Batch inference

Network considerations

Bandwidth optimization

Offline support

Performance monitoring

Tracking inference time

Performance metrics to track

Scalability

Concurrent users

Limitations