Data Encoding & Character Sets

Overview

Streamkap uses UTF-8 as its internal encoding for all data transit through the pipeline. Text data flowing through Streamkap is expected to be UTF-8 compatible. Binary data (such as BLOBs and raw byte columns) is handled separately and preserved in its original form or encoded as base64 depending on the source configuration and destination requirements. Key points:

All text fields are serialized as UTF-8 in Kafka
Binary columns (BLOB, BYTEA, VARBINARY, RAW) are configurable via the Represent binary data as source setting
Non-UTF-8 text data is not supported and may result in data loss or errors

How Encoding Works in the Pipeline

Data flows through the following stages:

Source Database --> Kafka (UTF-8 serialization) --> Destination

Data type	How it is handled
Text fields (VARCHAR, TEXT, NVARCHAR, CHAR)	Decoded and serialized as UTF-8 strings
Binary fields (BLOB, BYTEA, VARBINARY, RAW)	Preserved as binary; representation controlled by the Represent binary data as source connector setting (default: `bytes`)
JSON fields	Serialized as UTF-8 JSON strings
Decimal/Numeric fields	Encoded as bytes with schema metadata or as numeric strings, depending on connector configuration

Binary Data Representation

All source connectors that handle binary columns expose a Represent binary data as setting. This controls how binary column data (e.g., BLOB, BINARY, VARBINARY) is interpreted. Your destination for this data can impact which option you choose. The default is bytes.

The binary data representation setting is configured per source connector. See your specific source connector setup page for details.

Database-Specific Encoding Notes

PostgreSQL
MySQL
Oracle
SQL Server
MongoDB

PostgreSQL supports UTF-8 natively when the database is created with UTF8 encoding
Verify your database encoding: SHOW server_encoding;
The client_encoding should be set to UTF8 for the Streamkap connection
Binary data types (BYTEA) are handled via the Represent binary data as setting
Arrays, JSON/JSONB, and hstore types are serialized as UTF-8 strings
Unsupported: Non-UTF-8 database encodings (e.g., LATIN1, SQL_ASCII)

If your PostgreSQL database uses a non-UTF-8 encoding (such as SQL_ASCII or LATIN1), character data may not be captured correctly. Convert your database to UTF-8 encoding before configuring CDC.

Check your server character set: SHOW VARIABLES LIKE 'character_set_server';
Check your database character set: SHOW VARIABLES LIKE 'character_set_database';
Recommended: Use utf8mb4 as the character set (supports the full Unicode range, including emoji and CJK characters)
MySQL’s legacy utf8 charset only supports up to 3-byte characters (Basic Multilingual Plane) and cannot store 4-byte characters such as emoji
Binary data types (BLOB, BINARY, VARBINARY) are handled via the Represent binary data as setting
JSON, ENUM, and SET types are serialized as UTF-8 strings
Unsupported: Non-UTF-8 character sets (e.g., latin1, cp1252)

Check your database character set: SELECT value$ FROM sys.props$ WHERE name = 'NLS_CHARACTERSET';
Recommended: AL32UTF8 for best compatibility with Streamkap
NLS_NCHAR_CHARACTERSET should be AL16UTF16 (default) for NVARCHAR2/NCHAR columns; these are converted to UTF-8 by the connector
Binary types (RAW, LONG RAW, BLOB) are handled via the Represent binary data as setting
CLOB and NCLOB data is serialized as UTF-8 strings
XMLTYPE and JSON (12c+) are serialized as UTF-8 strings

nvarchar, nchar, and ntext columns use UTF-16 internally; the Streamkap connector converts these to UTF-8
varchar and char columns use the database collation’s code page; ensure these use a UTF-8 compatible collation (SQL Server 2019+ supports _UTF8 collations)
Binary types (BINARY, VARBINARY, IMAGE) are handled via the Represent binary data as setting
XML and hierarchyid types are serialized as UTF-8 strings
Unsupported: Non-UTF-8/UTF-16 encodings in older collations may cause data loss

Common Encoding Issues

Garbled characters at the destination

Symptoms: Text appears as question marks, mojibake, or unexpected characters at the destination.Common causes:

The source database is not using UTF-8 encoding
A character set mismatch between the database server and client connection
Legacy encoding (e.g., latin1 in MySQL or SQL_ASCII in PostgreSQL) storing non-ASCII data

Resolution:

Verify the source database encoding (see database-specific notes above)
Convert the database or affected tables/columns to UTF-8 (utf8mb4 for MySQL, UTF8 for PostgreSQL)
Ensure the database client connection is set to UTF-8

Missing characters in text fields

Symptoms: Some characters are silently dropped or replaced during transit.Common causes:

Non-UTF-8 characters in the source data that cannot be decoded
MySQL utf8 charset (3-byte) used instead of utf8mb4 (4-byte), truncating 4-byte characters
Source data contains invalid byte sequences

Resolution:

For MySQL, switch to utf8mb4: ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Identify and clean invalid byte sequences in the source data
Ensure the source database uses a full UTF-8 encoding

Binary data appears as base64 strings

Symptoms: BLOB, BYTEA, or VARBINARY columns appear as long base64-encoded strings at the destination instead of raw binary.Cause: This is expected behavior. Binary columns are serialized for transport through Kafka, which uses text-based serialization. The Represent binary data as setting on the source connector controls the format.Resolution:

If your destination needs raw binary, check whether it supports automatic base64 decoding
Review the Represent binary data as setting on your source connector (default: bytes)
For text-based destinations (JSON, CSV), base64 encoding is the standard approach for binary data

Special characters (emoji, CJK) not rendering

Symptoms: Emoji, Chinese/Japanese/Korean characters, or other multi-byte Unicode characters are missing or corrupted.Common causes:

MySQL using utf8 instead of utf8mb4 (only supports Basic Multilingual Plane, up to 3 bytes)
Destination column defined with insufficient character set support
Intermediate systems stripping 4-byte UTF-8 sequences

Resolution:

For MySQL sources, ensure the character set is utf8mb4 at the server, database, and table levels
Verify the destination table/column supports full UTF-8 (4-byte)
Test with a known multi-byte string (e.g., an emoji) before going to production

Best Practices

Use UTF-8 encoding at the source database level — this is the single most important step for preventing encoding issues
For MySQL, always use utf8mb4 — MySQL’s utf8 only supports 3-byte characters and silently truncates 4-byte characters (emoji, some CJK characters, mathematical symbols)
Test with special characters before production — insert rows containing emoji, accented characters, and CJK text, then verify they arrive correctly at the destination
For binary data, choose the right representation — review the Represent binary data as setting on your source connector and ensure your destination can handle the chosen format
Monitor the Dead Letter Queue (DLQ) for encoding errors — encoding-related failures appear in the Dead Letter Queue; check error headers for conversion or charset messages

Destination Encoding Configuration

Most destinations receive data as UTF-8 by default and do not require additional encoding configuration.

Redis

The Redis (Generic) destination includes a configurable Character Encoding parameter (default: UTF-8). This setting controls the character set used for encoding string values written to Redis. Adjust this only if your Redis consumers expect a different encoding.

Other Destinations

All other destinations use UTF-8 encoding by default. No additional encoding configuration is required.

Known Limitations

Non-UTF-8 error handling: The exact behavior when non-UTF-8 data is encountered (silent replacement, error, or routing to the DLQ) depends on the source connector and database. Consult your specific source connector documentation.
Encoding conversion as a transform: Encoding conversion is not currently available as a built-in transform option. Data must be UTF-8 compatible at the source.
Maximum string length: String length limits vary by destination. Check your destination connector documentation for column size constraints and the DLQ for size-related errors.

​Overview

​How Encoding Works in the Pipeline

​Binary Data Representation

​Database-Specific Encoding Notes

​Common Encoding Issues

​Best Practices

​Destination Encoding Configuration

​Redis

​Other Destinations

​Known Limitations

Overview

How Encoding Works in the Pipeline

Binary Data Representation

Database-Specific Encoding Notes

Common Encoding Issues

Best Practices

Destination Encoding Configuration

Redis

Other Destinations

Known Limitations