Supported Streaming Platforms
Apache Kafka
Distributed event streaming platform for high-throughput data pipelines
Apache Kafka
Overview
Apache Kafka is a distributed event streaming platform used for:- Real-time data pipelines - Stream data between systems
- Event sourcing - Capture state changes as events
- Log aggregation - Centralize logs from multiple services
- Stream processing - Process data in real-time with Kafka Streams or Flink
Configuration
Features
- High throughput - Millions of messages per second
- Low latency - Millisecond message delivery
- Message keys - Partition data by key for ordering
- JSON serialization - Automatic value serialization
- Automatic retries - Built-in error handling
- Multiple brokers - High availability clustering
Message Format
Mage sends messages as JSON with automatic serialization:Message Keys
Use message keys to ensure ordering within partitions:- Ordered processing - Messages with same key are ordered
- Partition affinity - Related messages go to same partition
- Consumer scaling - Distribute load across consumers
API Versions
Specify Kafka API version as a tuple:Kafka Authentication
SASL/PLAIN with SSL
Common for cloud providers like Confluent Cloud:SASL/SCRAM with SSL
More secure than PLAIN:SSL Only
Client certificate authentication:No Authentication (Development)
For local development:Topic Configuration
Single Topic
Export all data to one topic:Dynamic Topics
Route to different topics based on data (requires custom implementation):Data Serialization
JSON (Default)
Mage automatically serializes data as JSON:Key Serialization
Keys are UTF-8 encoded strings:Performance Tuning
Batch Settings
Batch Settings
Kafka producer batches messages for efficiency:Benefits:
- Reduced network overhead
- Higher throughput
- Lower latency with compression
Acknowledgment Levels
Acknowledgment Levels
Control durability vs performance:acks=0: Maximum performance, risk of data lossacks=1: Good balance, leader must write before ackacks=all: Maximum durability, all in-sync replicas must write
Partitioning Strategy
Partitioning Strategy
Message distribution across partitions:With Key:
- Hash of key determines partition
- Same key always goes to same partition
- Maintains ordering per key
- Round-robin across partitions
- No ordering guarantee
- Better load distribution
Example: Kafka Export Pipeline
Testing Connection
Common Use Cases
Event Streaming
Stream user events to Kafka for real-time analytics:Change Data Capture (CDC)
Capture database changes and stream to Kafka:Log Aggregation
Centralize logs from multiple sources:Error Handling
Connection Errors
Connection Errors
Error: “NoBrokersAvailable”Causes:
- Incorrect bootstrap server address
- Network connectivity issues
- Firewall blocking connection
- Wrong API version
Authentication Errors
Authentication Errors
Error: “AuthenticationFailure”Solutions:
- Verify username/password
- Check security_protocol matches cluster
- Ensure SASL mechanism is correct
- Verify SSL certificates are valid
Topic Errors
Topic Errors
Error: “UnknownTopicOrPartition”Causes:
- Topic doesn’t exist
- No permission to create topic
- Auto-creation disabled
Message Too Large
Message Too Large
Error: “MessageSizeTooLarge”Solution: Increase broker’s
message.max.bytes or reduce message sizeMonitoring
Producer Metrics
Monitor Kafka producer performance:Consumer Lag
Monitor if consumers are keeping up:Best Practices
Message Design
Message Design
Include MetadataKeep Messages Small
- Target < 1 MB per message
- Use references for large data
- Store large payloads in object storage
- Add optional fields for backward compatibility
- Include schema version in messages
- Consider using Avro or Protobuf
Partitioning Strategy
Partitioning Strategy
Choose Good KeysRight Number of Partitions
- High cardinality (many unique values)
- Evenly distributed
- Relevant for consumers
- Start with 3-5 partitions
- Scale up based on throughput needs
- Consider: partitions = max(producers, consumers)
Error Handling
Error Handling
Idempotent Producers
Enable idempotence to prevent duplicates:Retry ConfigurationDead Letter Queue
Route failed messages to separate topic for investigation
Integration with Other Services
Kafka Connect
Use Kafka Connect to sync Kafka topics with databases:Stream Processing
Process Kafka messages with:- Kafka Streams - Java/Scala stream processing
- Apache Flink - Distributed stream processing
- ksqlDB - SQL interface for Kafka
AWS Integration
For AWS Kinesis instead of Kafka, consider:- Kinesis has different API (not covered here)
- Similar concepts (streams, shards, records)
- Managed service with auto-scaling
Next Steps
Databases
Export to PostgreSQL, MySQL, and other databases
Data Warehouses
Configure BigQuery, Snowflake, and Redshift