> ## Documentation Index
> Fetch the complete documentation index at: https://docs.streamkap.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Encoding & Character Sets

> How Streamkap handles character encoding, binary data, and charset configuration in CDC pipelines.

## Overview

Streamkap uses **UTF-8** as its internal encoding for all data transit through the pipeline. Text data flowing through Streamkap is expected to be UTF-8 compatible. Binary data (such as BLOBs and raw byte columns) is handled separately and preserved in its original form or encoded as base64 depending on the source configuration and destination requirements.

Key points:

* All text fields are serialized as UTF-8 in Kafka
* Binary columns (`BLOB`, `BYTEA`, `VARBINARY`, `RAW`) are configurable via the **Represent binary data as** source setting
* Non-UTF-8 text data is not supported and may result in data loss or errors

## How Encoding Works in the Pipeline

Data flows through the following stages:

```text theme={null}
Source Database --> Kafka (UTF-8 serialization) --> Destination
```

| Data type                                       | How it is handled                                                                                                              |
| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| **Text fields** (VARCHAR, TEXT, NVARCHAR, CHAR) | Decoded and serialized as UTF-8 strings                                                                                        |
| **Binary fields** (BLOB, BYTEA, VARBINARY, RAW) | Preserved as binary; representation controlled by the **Represent binary data as** source connector setting (default: `bytes`) |
| **JSON fields**                                 | Serialized as UTF-8 JSON strings                                                                                               |
| **Decimal/Numeric fields**                      | Encoded as bytes with schema metadata or as numeric strings, depending on connector configuration                              |

### Binary Data Representation

All source connectors that handle binary columns expose a **Represent binary data as** setting. This controls how binary column data (e.g., `BLOB`, `BINARY`, `VARBINARY`) is interpreted. Your destination for this data can impact which option you choose. The default is `bytes`.

<Info>
  The binary data representation setting is configured per source connector. See your specific source connector setup page for details.
</Info>

## Database-Specific Encoding Notes

<Tabs>
  <Tab title="PostgreSQL">
    * PostgreSQL supports UTF-8 natively when the database is created with `UTF8` encoding
    * Verify your database encoding: `SHOW server_encoding;`
    * The `client_encoding` should be set to `UTF8` for the Streamkap connection
    * Binary data types (`BYTEA`) are handled via the **Represent binary data as** setting
    * Arrays, JSON/JSONB, and hstore types are serialized as UTF-8 strings
    * **Unsupported**: Non-UTF-8 database encodings (e.g., `LATIN1`, `SQL_ASCII`)

    <Warning>
      If your PostgreSQL database uses a non-UTF-8 encoding (such as `SQL_ASCII` or `LATIN1`), character data may not be captured correctly. Convert your database to UTF-8 encoding before configuring CDC.
    </Warning>
  </Tab>

  <Tab title="MySQL">
    * Check your server character set: `SHOW VARIABLES LIKE 'character_set_server';`
    * Check your database character set: `SHOW VARIABLES LIKE 'character_set_database';`
    * **Recommended**: Use `utf8mb4` as the character set (supports the full Unicode range, including emoji and CJK characters)
    * MySQL's legacy `utf8` charset only supports up to 3-byte characters (Basic Multilingual Plane) and cannot store 4-byte characters such as emoji
    * Binary data types (`BLOB`, `BINARY`, `VARBINARY`) are handled via the **Represent binary data as** setting
    * `JSON`, `ENUM`, and `SET` types are serialized as UTF-8 strings
    * **Unsupported**: Non-UTF-8 character sets (e.g., `latin1`, `cp1252`)
  </Tab>

  <Tab title="Oracle">
    * Check your database character set: `SELECT value$ FROM sys.props$ WHERE name = 'NLS_CHARACTERSET';`
    * **Recommended**: `AL32UTF8` for best compatibility with Streamkap
    * `NLS_NCHAR_CHARACTERSET` should be `AL16UTF16` (default) for NVARCHAR2/NCHAR columns; these are converted to UTF-8 by the connector
    * Binary types (`RAW`, `LONG RAW`, `BLOB`) are handled via the **Represent binary data as** setting
    * CLOB and NCLOB data is serialized as UTF-8 strings
    * XMLTYPE and JSON (12c+) are serialized as UTF-8 strings
  </Tab>

  <Tab title="SQL Server">
    * `nvarchar`, `nchar`, and `ntext` columns use UTF-16 internally; the Streamkap connector converts these to UTF-8
    * `varchar` and `char` columns use the database collation's code page; ensure these use a UTF-8 compatible collation (SQL Server 2019+ supports `_UTF8` collations)
    * Binary types (`BINARY`, `VARBINARY`, `IMAGE`) are handled via the **Represent binary data as** setting
    * XML and hierarchyid types are serialized as UTF-8 strings
    * **Unsupported**: Non-UTF-8/UTF-16 encodings in older collations may cause data loss
  </Tab>

  <Tab title="MongoDB">
    * BSON (MongoDB's binary JSON format) uses UTF-8 natively for all string data
    * String fields are serialized as UTF-8 without conversion
    * Binary data types are configurable (bytes, base64, or hex)
    * Array and nested document encoding is controlled by the **Array Encoding** and **Nested Document Encoding** source settings
    * **Unsupported**: Non-UTF-8 string data in BSON documents; oversized BSON documents (strategies: fail/skip/split)
  </Tab>
</Tabs>

## Common Encoding Issues

<AccordionGroup>
  <Accordion title="Garbled characters at the destination">
    **Symptoms**: Text appears as question marks, mojibake, or unexpected characters at the destination.

    **Common causes**:

    * The source database is not using UTF-8 encoding
    * A character set mismatch between the database server and client connection
    * Legacy encoding (e.g., `latin1` in MySQL or `SQL_ASCII` in PostgreSQL) storing non-ASCII data

    **Resolution**:

    1. Verify the source database encoding (see database-specific notes above)
    2. Convert the database or affected tables/columns to UTF-8 (`utf8mb4` for MySQL, `UTF8` for PostgreSQL)
    3. Ensure the database client connection is set to UTF-8
  </Accordion>

  <Accordion title="Missing characters in text fields">
    **Symptoms**: Some characters are silently dropped or replaced during transit.

    **Common causes**:

    * Non-UTF-8 characters in the source data that cannot be decoded
    * MySQL `utf8` charset (3-byte) used instead of `utf8mb4` (4-byte), truncating 4-byte characters
    * Source data contains invalid byte sequences

    **Resolution**:

    1. For MySQL, switch to `utf8mb4`: `ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;`
    2. Identify and clean invalid byte sequences in the source data
    3. Ensure the source database uses a full UTF-8 encoding
  </Accordion>

  <Accordion title="Binary data appears as base64 strings">
    **Symptoms**: `BLOB`, `BYTEA`, or `VARBINARY` columns appear as long base64-encoded strings at the destination instead of raw binary.

    **Cause**: This is expected behavior. Binary columns are serialized for transport through Kafka, which uses text-based serialization. The **Represent binary data as** setting on the source connector controls the format.

    **Resolution**:

    * If your destination needs raw binary, check whether it supports automatic base64 decoding
    * Review the **Represent binary data as** setting on your source connector (default: `bytes`)
    * For text-based destinations (JSON, CSV), base64 encoding is the standard approach for binary data
  </Accordion>

  <Accordion title="Special characters (emoji, CJK) not rendering">
    **Symptoms**: Emoji, Chinese/Japanese/Korean characters, or other multi-byte Unicode characters are missing or corrupted.

    **Common causes**:

    * MySQL using `utf8` instead of `utf8mb4` (only supports Basic Multilingual Plane, up to 3 bytes)
    * Destination column defined with insufficient character set support
    * Intermediate systems stripping 4-byte UTF-8 sequences

    **Resolution**:

    1. For MySQL sources, ensure the character set is `utf8mb4` at the server, database, and table levels
    2. Verify the destination table/column supports full UTF-8 (4-byte)
    3. Test with a known multi-byte string (e.g., an emoji) before going to production
  </Accordion>
</AccordionGroup>

## Best Practices

1. **Use UTF-8 encoding at the source database level** -- this is the single most important step for preventing encoding issues
2. **For MySQL, always use `utf8mb4`** -- MySQL's `utf8` only supports 3-byte characters and silently truncates 4-byte characters (emoji, some CJK characters, mathematical symbols)
3. **Test with special characters before production** -- insert rows containing emoji, accented characters, and CJK text, then verify they arrive correctly at the destination
4. **For binary data, choose the right representation** -- review the **Represent binary data as** setting on your source connector and ensure your destination can handle the chosen format
5. **Monitor the Dead Letter Queue (DLQ) for encoding errors** -- encoding-related failures appear in the [Dead Letter Queue](/dlq-operations); check error headers for conversion or charset messages

## Destination Encoding Configuration

Most destinations receive data as UTF-8 by default and do not require additional encoding configuration.

### Redis

The [Redis (Generic)](/redis-destination-generic) destination includes a configurable **Character Encoding** parameter (default: `UTF-8`). This setting controls the character set used for encoding string values written to Redis. Adjust this only if your Redis consumers expect a different encoding.

### Other Destinations

All other destinations use UTF-8 encoding by default. No additional encoding configuration is required.

## Known Limitations

* **Non-UTF-8 error handling**: The exact behavior when non-UTF-8 data is encountered (silent replacement, error, or routing to the DLQ) depends on the source connector and database. Consult your specific source connector documentation.
* **Encoding conversion as a transform**: Encoding conversion is not currently available as a built-in transform option. Data must be UTF-8 compatible at the source.
* **Maximum string length**: String length limits vary by destination. Check your destination connector documentation for column size constraints and the [DLQ](/dlq-operations) for size-related errors.
