Overview
Streamkap uses UTF-8 as its internal encoding for all data transit through the pipeline. Text data flowing through Streamkap is expected to be UTF-8 compatible. Binary data (such as BLOBs and raw byte columns) is handled separately and preserved in its original form or encoded as base64 depending on the source configuration and destination requirements. Key points:- All text fields are serialized as UTF-8 in Kafka
- Binary columns (
BLOB,BYTEA,VARBINARY,RAW) are configurable via the Represent binary data as source setting - Non-UTF-8 text data is not supported and may result in data loss or errors
How Encoding Works in the Pipeline
Data flows through the following stages:| Data type | How it is handled |
|---|---|
| Text fields (VARCHAR, TEXT, NVARCHAR, CHAR) | Decoded and serialized as UTF-8 strings |
| Binary fields (BLOB, BYTEA, VARBINARY, RAW) | Preserved as binary; representation controlled by the Represent binary data as source connector setting (default: bytes) |
| JSON fields | Serialized as UTF-8 JSON strings |
| Decimal/Numeric fields | Encoded as bytes with schema metadata or as numeric strings, depending on connector configuration |
Binary Data Representation
All source connectors that handle binary columns expose a Represent binary data as setting. This controls how binary column data (e.g.,BLOB, BINARY, VARBINARY) is interpreted. Your destination for this data can impact which option you choose. The default is bytes.
The binary data representation setting is configured per source connector. See your specific source connector setup page for details.
Database-Specific Encoding Notes
- PostgreSQL
- MySQL
- Oracle
- SQL Server
- MongoDB
- PostgreSQL supports UTF-8 natively when the database is created with
UTF8encoding - Verify your database encoding:
SHOW server_encoding; - The
client_encodingshould be set toUTF8for the Streamkap connection - Binary data types (
BYTEA) are handled via the Represent binary data as setting - Arrays, JSON/JSONB, and hstore types are serialized as UTF-8 strings
- Unsupported: Non-UTF-8 database encodings (e.g.,
LATIN1,SQL_ASCII)
Common Encoding Issues
Garbled characters at the destination
Garbled characters at the destination
Symptoms: Text appears as question marks, mojibake, or unexpected characters at the destination.Common causes:
- The source database is not using UTF-8 encoding
- A character set mismatch between the database server and client connection
- Legacy encoding (e.g.,
latin1in MySQL orSQL_ASCIIin PostgreSQL) storing non-ASCII data
- Verify the source database encoding (see database-specific notes above)
- Convert the database or affected tables/columns to UTF-8 (
utf8mb4for MySQL,UTF8for PostgreSQL) - Ensure the database client connection is set to UTF-8
Missing characters in text fields
Missing characters in text fields
Symptoms: Some characters are silently dropped or replaced during transit.Common causes:
- Non-UTF-8 characters in the source data that cannot be decoded
- MySQL
utf8charset (3-byte) used instead ofutf8mb4(4-byte), truncating 4-byte characters - Source data contains invalid byte sequences
- For MySQL, switch to
utf8mb4:ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; - Identify and clean invalid byte sequences in the source data
- Ensure the source database uses a full UTF-8 encoding
Binary data appears as base64 strings
Binary data appears as base64 strings
Symptoms:
BLOB, BYTEA, or VARBINARY columns appear as long base64-encoded strings at the destination instead of raw binary.Cause: This is expected behavior. Binary columns are serialized for transport through Kafka, which uses text-based serialization. The Represent binary data as setting on the source connector controls the format.Resolution:- If your destination needs raw binary, check whether it supports automatic base64 decoding
- Review the Represent binary data as setting on your source connector (default:
bytes) - For text-based destinations (JSON, CSV), base64 encoding is the standard approach for binary data
Special characters (emoji, CJK) not rendering
Special characters (emoji, CJK) not rendering
Symptoms: Emoji, Chinese/Japanese/Korean characters, or other multi-byte Unicode characters are missing or corrupted.Common causes:
- MySQL using
utf8instead ofutf8mb4(only supports Basic Multilingual Plane, up to 3 bytes) - Destination column defined with insufficient character set support
- Intermediate systems stripping 4-byte UTF-8 sequences
- For MySQL sources, ensure the character set is
utf8mb4at the server, database, and table levels - Verify the destination table/column supports full UTF-8 (4-byte)
- Test with a known multi-byte string (e.g., an emoji) before going to production
Best Practices
- Use UTF-8 encoding at the source database level — this is the single most important step for preventing encoding issues
- For MySQL, always use
utf8mb4— MySQL’sutf8only supports 3-byte characters and silently truncates 4-byte characters (emoji, some CJK characters, mathematical symbols) - Test with special characters before production — insert rows containing emoji, accented characters, and CJK text, then verify they arrive correctly at the destination
- For binary data, choose the right representation — review the Represent binary data as setting on your source connector and ensure your destination can handle the chosen format
- Monitor the Dead Letter Queue (DLQ) for encoding errors — encoding-related failures appear in the Dead Letter Queue; check error headers for conversion or charset messages
Destination Encoding Configuration
Most destinations receive data as UTF-8 by default and do not require additional encoding configuration.Redis
The Redis (Generic) destination includes a configurable Character Encoding parameter (default:UTF-8). This setting controls the character set used for encoding string values written to Redis. Adjust this only if your Redis consumers expect a different encoding.
Other Destinations
All other destinations use UTF-8 encoding by default. No additional encoding configuration is required.Known Limitations
- Non-UTF-8 error handling: The exact behavior when non-UTF-8 data is encountered (silent replacement, error, or routing to the DLQ) depends on the source connector and database. Consult your specific source connector documentation.
- Encoding conversion as a transform: Encoding conversion is not currently available as a built-in transform option. Data must be UTF-8 compatible at the source.
- Maximum string length: String length limits vary by destination. Check your destination connector documentation for column size constraints and the DLQ for size-related errors.