MOLT Replicator Troubleshooting

On this page Carat arrow pointing down

Forward replication issues

Performance troubleshooting

If MOLT Replicator appears hung or performs poorly:

  1. Enable trace logging with -vv to get more visibility into the replicator's state and behavior.

  2. If MOLT Replicator is in an unknown, hung, or erroneous state, collect performance profiles to include with support tickets. Replace {host} and {metrics-port} with your Replicator host and the port specified by --metricsAddr:

    icon/buttons/copy
    curl '{host}:{metrics-port}/debug/pprof/trace?seconds=15' > trace.out
    curl '{host}:{metrics-port}/debug/pprof/profile?seconds=15' > profile.out
    curl '{host}:{metrics-port}/debug/pprof/goroutine?seconds=15' > gr.out
    curl '{host}:{metrics-port}/debug/pprof/heap?seconds=15' > heap.out
    
  3. Monitor lag metrics and adjust performance parameters as needed.

Unable to create publication or slot

This error occurs when logical replication is not supported.

Resolution: If you are connected to a replica, connect to the primary instance instead. Replicas cannot create or manage logical replication slots or publications.

Verify that the source database supports logical replication by checking the wal_level parameter on PostgreSQL:

icon/buttons/copy
SHOW wal_level;

If wal_level is not set to logical, update it and restart PostgreSQL:

icon/buttons/copy
ALTER SYSTEM SET wal_level = 'logical';
Replication slot already exists
ERROR: replication slot "molt_slot" already exists

Resolution: Either create a new slot with a different name, or drop the existing slot to start fresh:

icon/buttons/copy
SELECT pg_drop_replication_slot('molt_slot');
Warning:

Dropping a replication slot can be destructive and delete data that is not yet replicated. Only use this if you want to restart replication from the current position.

Publication does not exist
run CREATE PUBLICATION molt_fetch FOR ALL TABLES;

Resolution: Create the publication on the source database. Ensure you also create the replication slot:

icon/buttons/copy
CREATE PUBLICATION molt_publication FOR ALL TABLES;
SELECT pg_create_logical_replication_slot('molt_slot', 'pgoutput');
Could not connect to PostgreSQL
could not connect to source database: failed to connect to `user=migration_user database=migration_db`

Resolution: Verify the connection details including user, host, port, and database name. Ensure the database name in your --sourceConn connection string matches exactly where you created the publication and slot. Verify you're connecting to the same host and port where you ran the CREATE PUBLICATION and SELECT pg_create_logical_replication_slot() commands. Check if TLS certificates need to be included in the connection URI.

Wrong replication slot name
run SELECT pg_create_logical_replication_slot('molt_slot', 'pgoutput'); in source database

Resolution: Create the replication slot or verify the correct slot name:

icon/buttons/copy
SELECT pg_create_logical_replication_slot('molt_slot', 'pgoutput');
Repeated binlog syncing restarts

If Replicator repeatedly restarts binlog syncing or starts replication from an unexpectedly old location, this indicates an invalid or purged GTID. When an invalid GTID is provided, the binlog syncer will fall back to the first valid GTID.

Resolution: Verify the GTID set is valid and not purged:

icon/buttons/copy
-- Check if GTID is in executed set
SELECT GTID_SUBSET('your-gtid-set', @@GLOBAL.gtid_executed) AS in_executed;

-- Check if GTID is purged
SELECT GTID_SUBSET('your-gtid-set', @@GLOBAL.gtid_purged) AS in_purged;

Interpret the results as follows:

  • If in_executed returns 1 and in_purged returns 0, the GTID is valid for replication.
  • If in_purged returns 1, the GTID has been purged and you must find a newer consistent point.
  • If both return 0, the GTID doesn't exist in the records and is invalid.

If the GTID is purged or invalid, follow these steps:

  1. Increase binlog retention by configuring binlog_expire_logs_seconds in MySQL:

    icon/buttons/copy
    -- Increase binlog retention (example: 7 days = 604800 seconds)
    SET GLOBAL binlog_expire_logs_seconds = 604800;
    
    Note:

    For managed MySQL services (such as Amazon RDS, Google Cloud SQL, or Azure Database for MySQL), binlog retention is typically configured through the provider's console or CLI. Consult your provider's documentation for how to adjust binlog retention settings.

  2. Get a current GTID set to restart replication:

    icon/buttons/copy
    -- For MySQL < 8.0:
    SHOW MASTER STATUS;
    -- For MySQL 8.0+:
    SHOW BINARY LOG STATUS;
    
    +---------------+----------+--------------+------------------+-------------------------------------------+
    | File          | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                         |
    +---------------+----------+--------------+------------------+-------------------------------------------+
    | binlog.000005 |      197 |              |                  | 77263736-7899-11f0-81a5-0242ac120002:1-38 |
    +---------------+----------+--------------+------------------+-------------------------------------------+
    

    Use the Executed_Gtid_Set value for the --defaultGTIDSet flag.

Invalid GTID format

Invalid GTIDs can occur when GTIDs are purged due to insufficient binlog retention, when connecting to a replica instead of the primary host, or when passing a GTID that has valid format but doesn't exist in the binlog history.

Resolution: Use a valid GTID from SHOW MASTER STATUS (MySQL < 8.0) or SHOW BINARY LOG STATUS (MySQL 8.0+) and ensure you're connecting to the primary host. If GTIDs are being purged, increase binlog retention.

Table/column names exceed 30 characters

Oracle LogMiner excludes tables and columns with names longer than 30 characters from redo logs.

Resolution: Rename tables and columns to 30 characters or fewer before migration.

Unsupported data types

LogMiner and replication do not support:

  • Long BLOB/CLOBs (4000+ characters)
  • User-defined types (UDTs)
  • Nested tables
  • Varrays
  • GEOGRAPHY and GEOMETRY

Resolution: Convert unsupported data types or exclude affected tables from replication.

LOB column UPDATE statements

UPDATE statements that only modify LOB columns are not supported by Oracle LogMiner.

Resolution: Avoid LOB-only updates during replication, or use Binary Reader for Oracle 12c.

JSONB null handling

SQL NULL and JSON null values are not distinguishable in JSON payloads during replication.

Resolution: Avoid using nullable JSONB columns where the distinction between SQL NULL and JSON null is important.

Missing redo logs or unavailable SCN

If the Oracle redo log files are too small or do not retain enough history, you may get errors indicating that required log files are missing for a given SCN range, or that a specific SCN is unavailable.

Resolution: Increase the number and size of online redo log files, and verify that archived log files are being generated and retained correctly in your Oracle environment.

Replicator lag

If the replicator process is lagging significantly behind the current Oracle SCN, you may see log messages like: replicator is catching up to the current SCN at 5000 from 1000…. This indicates that replication is progressing but is still behind the most recent changes on the source database.

Schema drift errors

Indicates source and target schemas are mismatched:

WARNING: schema drift detected in "database"."table" at payload object offset 0: unexpected columns: column_name

Resolution: Align schemas or use userscripts to transform data.

Apply flow failures

Apply flow failures occur when the target database encounters error conditions such as unique constraint violations, target database being unavailable, or incorrect data (missing or extraneous columns) during apply operations:

WARNING: warning during tryCommit: ERROR: duplicate key value violates unique constraint
ERROR: maximum number of retries (10) exceeded

Resolution: Check target database constraints and connection stability. MOLT Replicator will log warnings for each retry attempt. If you see warnings but no final error, the apply succeeded after retrying. If all retry attempts are exhausted, Replicator will surface a final error and restart the apply loop to continue processing.

Failback issues

If the changefeed shows connection errors in SHOW CHANGEFEED JOB:

Connection refused
transient error: Post "https://replicator-host:30004/migration_db/migration_schema": dial tcp [::1]:30004: connect: connection refused

This indicates that Replicator is down, the webhook URL is incorrect, or the port is misconfigured.

Resolution: Verify that MOLT Replicator is running on the port specified in the changefeed INTO configuration. Confirm the host and port are correct.

Incorrect schema path errors

This error occurs when the CockroachDB changefeed webhook URL path does not match the target database schema naming convention:

transient error: 400 Bad Request: unknown schema:

The webhook URL path is specified in the INTO clause when you create the changefeed. For example: webhook-https://replicator-host:30004/database/schema.

Resolution: Verify the webhook path format matches your target database type:

  • PostgreSQL targets should use /database/schema format. For example, webhook-https://replicator-host:30004/migration_db/migration_schema.
  • MySQL targets should use /database format. For example, webhook-https://replicator-host:30004/migration_db.
  • Oracle targets should use /SCHEMA format in uppercase. For example, webhook-https://replicator-host:30004/MIGRATION_SCHEMA.

For details on configuring the webhook sink URI, refer to Webhook sink.

GC threshold error
batch timestamp * must be after replica GC threshold

This indicates starting from an invalid cursor that has been garbage collected.

Resolution: Double-check the cursor to ensure it represents a valid range that has not been garbage collected, or extend the GC TTL on the source CockroachDB cluster:

icon/buttons/copy
ALTER DATABASE defaultdb CONFIGURE ZONE USING gc.ttlseconds = {gc_ttl_in_seconds};
Duplicated data re-application

This occurs when resuming a changefeed from a cursor causes excessive data duplication.

Resolution: Clear the staging database to prevent duplication. This deletes all checkpoints and buffered data, so use with caution:

icon/buttons/copy
DROP DATABASE _replicator;

For more targeted cleanup, delete mutations from specific staging tables:

icon/buttons/copy
DELETE FROM _replicator.employees WHERE true;

See also

×