clickhouse-sink-connector/sink-connector/python/README.md
2023-04-04 13:29:46 -04:00

86 lines
4.4 KiB
Markdown

## Installation
source ./install.sh
## Loading data to ClickHouse from a dump
clickhouse_loader.py is a program that loads data dumped in MySQL into a CH database compatible the sink connector (ReplacingMergeTree with virtual columns _version and _sign)
It creates the schema (--schema_only option if you only need the schema), and loads the data using clickhouse-client and zstd to decompress the dump files.
The schema conversion is very crude. It does not use a SQL parser but it just applies regexp filters
Please submit enhancement requires or bug fixes.
Limitations :
- only fully tested with mysqlsh dumps (sakila, employees, airportdb, world, menagerie)
- ClickHouse Date / DateTime range conversion can lead to data loss if outside the supported ranges
- it does not support partition creation
- it does not support dump and load from a S3 compatible bucket
- make sure you run checksums to validate the load
Example : assuming the world database was dumped with mysqlsh using this command ($HOME needs to be replaced with a constant) :
MySQL Shell(JS version works better with Python)
```
mysqlsh -uroot -proot -hlocalhost -e "util.dumpSchemas(['world'], '$HOME/dbdumps/world');"
## Dump separate tables
## If there are no permissions for LOCKING, use consistent: false flag.
util.dumpTables("db_name", ["table1", "table2"], "output_directory", {"consistent": "false", "threads": "100"})
```
!mydumper is not supported at this stage!
```
DATABASE=world
## Error with max number of concurrent connections(100)
docker exec -it clickhouse clickhouse-client -uroot --password root --query "drop database if exists $DATABASE"
python db_load/clickhouse_loader.py --clickhouse_host localhost --clickhouse_database $DATABASE --dump_dir $HOME/dbdumps/$DATABASE --clickhouse_user root --clickhouse_password root --threads 50 --mysql_source_database $DATABASE --mysqlshell
```
If you loaded the same data in MySQL, you can then run checksums (see test_db.sh)
## Table checksums
Credits : https://www.sisense.com/blog/hashing-tables-to-ensure-consistency-in-postgres-redshift-and-mysql/
Compute the checksum of one table
```
python db_compare/clickhouse_table_checksum.py --clickhouse_host localhost --clickhouse_user root --clickhouse_password root --clickhouse_database menagerie --tables_regex ^pet --threads 4
2022-09-11 19:39:45,455 - INFO - ThreadPoolExecutor-0_0 - Checksum for table menagerie.pet = 3d19b8b13cf29b5192068278123c5059 count 9
python db_compare/mysql_table_checksum.py --mysql_host localhost --mysql_user root --mysql_password root --mysql_database menagerie --tables_regex ^pet --threads 4
2022-09-11 19:39:49,148 - INFO - ThreadPoolExecutor-0_0 - Checksum for table menagerie.pet = 3d19b8b13cf29b5192068278123c5059 count 9
```
## Connecting to secure ClickHouse
```
python3 clickhouse_table_checksum.py --sign_column=sign --secure=True --clickhouse_port 9440 --clickhouse_host secure-host --clickhouse_user user --clickhouse_password password --clickhouse_database das --tables_regex '^products' --exclude_columns=[sign,ver]
```
## Exclude columns
```
python3 clickhouse_table_checksum.py --sign_column=sign --secure=True --clickhouse_port 9440 --clickhouse_host secure-host --clickhouse_user user --clickhouse_password password --clickhouse_database das --tables_regex '^products' --exclude_columns=sign,ver --debug_output --debug
```
Compare all tables in database and diff the checksums
```
python db_compare/clickhouse_table_checksum.py --clickhouse_host localhost --clickhouse_user root --clickhouse_password root --clickhouse_database menagerie --tables_regex . --threads 4 | grep "Checksum for table" | awk '{print $11" "$13" "$15}' | sort >menagerie.ch
python db_compare/mysql_table_checksum.py --mysql_host localhost --mysql_user root --mysql_password root --mysql_database menagerie --tables_regex . --threads 4 | grep "Checksum for table" | awk '{print $11" "$13" "$15}' | sort >menagerie.mysql
diff menagerie.ch menagerie.mysql | grep "<\|>"
```
Debug differences
Example table pet :
```
python db_compare/mysql_table_checksum.py --mysql_host localhost --mysql_user root --mysql_password root --mysql_database menagerie --tables_regex "^pet" --debug_output
python db_compare/clickhouse_table_checksum.py --clickhouse_host localhost --clickhouse_user root --clickhouse_password root --clickhouse_database menagerie --tables_regex "^pet" --debug_output
diff out.pet.ch.txt out.pet.mysql.txt | grep "<\|>"
```