2.1. Streaming Infrastructure
Real-time data pipelines rely on robust streaming infrastructure. This page outlines the building blocks of a modern streaming system, including schema design, stream architecture, emitters and sinks, and integration using Kafka Connect.
Stream Design
A stream is typically represented as a Kafka topic. Structuring your streams well ensures clarity, reusability, and performance.
Best practices:
Use domain-based topic names:
ecommerce.orders.created
Prefer one event type per topic unless there’s a strong reason to combine
Version topics if breaking schema changes are expected (e.g.,
orders.v2
)
Partitioning guidelines:
Partition by IDs (
order_id
,user_id
) for ordering guarantees Kafka only guarantees order within a partition — so all events for the same ID must go to the same partition to preserve sequence.Example (using user_id as key in Python):
key = record["user_id"].encode("utf-8") producer.send("user-events", key=key, value=record)
Example (compound key with user_id + session_id):
compound_key = f"{record['user_id']}:{record['session_id']}".encode("utf-8") producer.send("session-events", key=compound_key, value=record)
Use high-cardinality keys to distribute load High cardinality — many unique values — helps spread records across partitions evenly, preventing hot spots and enabling parallel processing.
Example (use session_id to spread load):
key = record["session_id"].encode("utf-8") producer.send("clickstream", key=key, value=record)
Ensure partitions are balanced for parallelism, after the data has already landed in Kafka When producer-side keys are suboptimal, you can still improve distribution by:
Rekeying in stream processors: consume events, assign a high-cardinality key (e.g.,
user_id
), and write to a new topic with better partitioning.
Example (Faust rekeying by user_id):
import faust app = faust.App("rekeying-app", broker="kafka://localhost:9092") class Event(faust.Record): user_id: str event_type: str source = app.topic("events_by_country", value_type=Event) target = app.topic("events_by_user", key_type=str, value_type=Event) @app.agent(source) async def process(events): async for event in events: await target.send(key=event.user_id, value=event)
Custom partitioner: Implement logic in your producer to assign partitions manually when default hashing is insufficient.
Increase partition count: More partitions allow greater consumer parallelism, especially useful when keys can’t be optimized at the source.
Event Schemas
Defining consistent, versioned event schemas is key to reliable and scalable stream processing.
Why schemas matter:
Enforce data contracts between producers and consumers
Validate structure before sending/processing events
Enable safe schema evolution
Power downstream automation (e.g., code generation, analytics models)
Formats
AVRO is the most commonly used for Kafka events due to its balance of compactness and compatibility features.
Versioning Strategy
Types of changes:
Non-breaking changes (schema evolution allowed on same topic): - Add optional fields with defaults - Add new fields with null union types - Change logical types (e.g., add timestamp-millis)
Breaking changes (requires new topic version): - Remove or rename fields - Change required field types - Modify enum values incompatibly
Best practices:
Use schema evolution for safe, additive changes
For breaking changes, publish to a new topic (e.g., orders.v2)
Version schema files and record names to make changes explicit:
schemas/
orders/
order_created.v1.avsc
order_created.v2.avsc
{
"type": "record",
"name": "OrderCreatedV2",
"namespace": "ecommerce.orders.v2",
// remaining fields omitted
}
This versioning convention allows producers and consumers to gradually migrate, while preserving backward compatibility where possible.