Skip to main content

Overview

Streamkap uses UTF-8 as its internal encoding for all data transit through the pipeline. Text data flowing through Streamkap is expected to be UTF-8 compatible. Binary data (such as BLOBs and raw byte columns) is handled separately and preserved in its original form or encoded as base64 depending on the source configuration and destination requirements. Key points:
  • All text fields are serialized as UTF-8 in Kafka
  • Binary columns (BLOB, BYTEA, VARBINARY, RAW) are configurable via the Represent binary data as source setting
  • Non-UTF-8 text data is not supported and may result in data loss or errors

How Encoding Works in the Pipeline

Data flows through the following stages:
Source Database --> Kafka (UTF-8 serialization) --> Destination
Data typeHow it is handled
Text fields (VARCHAR, TEXT, NVARCHAR, CHAR)Decoded and serialized as UTF-8 strings
Binary fields (BLOB, BYTEA, VARBINARY, RAW)Preserved as binary; representation controlled by the Represent binary data as source connector setting (default: bytes)
JSON fieldsSerialized as UTF-8 JSON strings
Decimal/Numeric fieldsEncoded as bytes with schema metadata or as numeric strings, depending on connector configuration

Binary Data Representation

All source connectors that handle binary columns expose a Represent binary data as setting. This controls how binary column data (e.g., BLOB, BINARY, VARBINARY) is interpreted. Your destination for this data can impact which option you choose. The default is bytes.
The binary data representation setting is configured per source connector. See your specific source connector setup page for details.

Database-Specific Encoding Notes

  • PostgreSQL supports UTF-8 natively when the database is created with UTF8 encoding
  • Verify your database encoding: SHOW server_encoding;
  • The client_encoding should be set to UTF8 for the Streamkap connection
  • Binary data types (BYTEA) are handled via the Represent binary data as setting
  • Arrays, JSON/JSONB, and hstore types are serialized as UTF-8 strings
  • Unsupported: Non-UTF-8 database encodings (e.g., LATIN1, SQL_ASCII)
If your PostgreSQL database uses a non-UTF-8 encoding (such as SQL_ASCII or LATIN1), character data may not be captured correctly. Convert your database to UTF-8 encoding before configuring CDC.

Common Encoding Issues

Symptoms: Text appears as question marks, mojibake, or unexpected characters at the destination.Common causes:
  • The source database is not using UTF-8 encoding
  • A character set mismatch between the database server and client connection
  • Legacy encoding (e.g., latin1 in MySQL or SQL_ASCII in PostgreSQL) storing non-ASCII data
Resolution:
  1. Verify the source database encoding (see database-specific notes above)
  2. Convert the database or affected tables/columns to UTF-8 (utf8mb4 for MySQL, UTF8 for PostgreSQL)
  3. Ensure the database client connection is set to UTF-8
Symptoms: Some characters are silently dropped or replaced during transit.Common causes:
  • Non-UTF-8 characters in the source data that cannot be decoded
  • MySQL utf8 charset (3-byte) used instead of utf8mb4 (4-byte), truncating 4-byte characters
  • Source data contains invalid byte sequences
Resolution:
  1. For MySQL, switch to utf8mb4: ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
  2. Identify and clean invalid byte sequences in the source data
  3. Ensure the source database uses a full UTF-8 encoding
Symptoms: BLOB, BYTEA, or VARBINARY columns appear as long base64-encoded strings at the destination instead of raw binary.Cause: This is expected behavior. Binary columns are serialized for transport through Kafka, which uses text-based serialization. The Represent binary data as setting on the source connector controls the format.Resolution:
  • If your destination needs raw binary, check whether it supports automatic base64 decoding
  • Review the Represent binary data as setting on your source connector (default: bytes)
  • For text-based destinations (JSON, CSV), base64 encoding is the standard approach for binary data
Symptoms: Emoji, Chinese/Japanese/Korean characters, or other multi-byte Unicode characters are missing or corrupted.Common causes:
  • MySQL using utf8 instead of utf8mb4 (only supports Basic Multilingual Plane, up to 3 bytes)
  • Destination column defined with insufficient character set support
  • Intermediate systems stripping 4-byte UTF-8 sequences
Resolution:
  1. For MySQL sources, ensure the character set is utf8mb4 at the server, database, and table levels
  2. Verify the destination table/column supports full UTF-8 (4-byte)
  3. Test with a known multi-byte string (e.g., an emoji) before going to production

Best Practices

  1. Use UTF-8 encoding at the source database level — this is the single most important step for preventing encoding issues
  2. For MySQL, always use utf8mb4 — MySQL’s utf8 only supports 3-byte characters and silently truncates 4-byte characters (emoji, some CJK characters, mathematical symbols)
  3. Test with special characters before production — insert rows containing emoji, accented characters, and CJK text, then verify they arrive correctly at the destination
  4. For binary data, choose the right representation — review the Represent binary data as setting on your source connector and ensure your destination can handle the chosen format
  5. Monitor the Dead Letter Queue (DLQ) for encoding errors — encoding-related failures appear in the Dead Letter Queue; check error headers for conversion or charset messages

Destination Encoding Configuration

Most destinations receive data as UTF-8 by default and do not require additional encoding configuration.

Redis

The Redis (Generic) destination includes a configurable Character Encoding parameter (default: UTF-8). This setting controls the character set used for encoding string values written to Redis. Adjust this only if your Redis consumers expect a different encoding.

Other Destinations

All other destinations use UTF-8 encoding by default. No additional encoding configuration is required.

Known Limitations

  • Non-UTF-8 error handling: The exact behavior when non-UTF-8 data is encountered (silent replacement, error, or routing to the DLQ) depends on the source connector and database. Consult your specific source connector documentation.
  • Encoding conversion as a transform: Encoding conversion is not currently available as a built-in transform option. Data must be UTF-8 compatible at the source.
  • Maximum string length: String length limits vary by destination. Check your destination connector documentation for column size constraints and the DLQ for size-related errors.