TrinoDBT不同类型数据库数据清洗统计
创始人
2024-06-03 08:38:53
0

Trino&DBT不同类型数据库数据清洗统计

流程图

在这里插入图片描述

环境搭建

采用docker安装,还是一如既往使用我最喜欢的docker-compose;

Clickhouse和MySQL的安装中规中矩,也不考虑什么架构,直接拉取镜像打包即可

主要是:

1、dbt需要提前制作好镜像,参考:https://blog.csdn.net/huxiang19851114/article/details/126626213?spm=1001.2014.3001.5502

2、trino需要提前配置好环境变量,参考:https://blog.csdn.net/huxiang19851114/article/details/125926092?spm=1001.2014.3001.5502

PS:如果不想自己动手,可以直接下载我上传的文件目录:https://download.csdn.net/download/huxiang19851114/87570325,下载完后解压,除了dbt需要执行docker build生成镜像外,其他都不要改了:

docker build --tag dbt:v1.0.0 .

Docker Compose

version: '3.6'
services:clickhouse:image: yandex/clickhouse-server:20.8.7.15container_name: clickhouseports:- "8123:8123"- "9000:9000"mysql:container_name: mysqlimage: mysql:5.6ports:- 3306:3306environment:- MYSQL_ROOT_PASSWORD=root- MYSQL_USER=huxiang- MYSQL_PASSWORD=huxiangvolumes:- /usr/local/dbt-trino-mysql-clickhouse/mysql/db:/var/lib/mysql- /usr/local/dbt-trino-mysql-clickhouse/mysql/conf:/etc/mysql/mysql.conf.ddbt:image: dbt:v1.0.0container_name: dbtvolumes:- /usr/local/dbt-trino-mysql-clickhouse/dbt/app:/usr/app- /usr/local/dbt-trino-mysql-clickhouse/dbt/dbt-profiles:/root/.dbttty: truestdin_open: truetrino:image: trinodb/trino:359container_name: trinovolumes:- /usr/local/dbt-trino-mysql-clickhouse/trino/etc/:/etc/trino/ports:- 10080:10080tty: truestdin_open: true

Trino配置Connector

这里我们需要配置两个Connector,分别指向源数据库的读(MySQL),目标数据库的写(ClickHouse)

#/usr/local/dbt-trino-mysql-clickhouse/trino/etc/catalog/source_dw_mysql.properties
connector.name=mysql
connection-url=jdbc:mysql://192.168.137.128:3306?enabledTLSProtocols=TLSv1.2&useSSL=false
connection-user=root
connection-password=root#/usr/local/dbt-trino-mysql-clickhouse/trino/etc/catalog/target_dw_clickhouse.properties
connector.name=clickhouse
connection-url=jdbc:clickhouse://192.168.137.128:8123
connection-user=default
connection-password=

重启Trino容器后能看到如下数据源connector,说明配置连接成功:

在这里插入图片描述

DBT配置

DBT对应的数据连接为trino,只不过各自指向source_dw_mysql和target_dw_clickhouse

配置工程加载文件

 #vim /usr/local/dbt-trino-mysql-clickhouse/dbt/app/dbt_project/dbt_project.yml#如果该文件已自动,则增加如下配置即可,作为源库:全局数据库服务和库名变量:vars:matomo_catalog: source_dw_mysqlmatomo_schema: matomo
----------------------------------------------------------------------------# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'dbt_project'
version: '1.0.0'
config-version: 2# This setting configures which "profile" dbt uses for this project.
profile: 'dbt_project'# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models/example"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]target-path: "target"  # directory which will store compiled SQL files
clean-targets:         # directories to be removed by `dbt clean`- "target"- "dbt_packages"vars:matomo_catalog: source_dw_mysqlmatomo_schema: matomo# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:# Config indicated by + and applies to all files under models/example/example:+materialized: table

配置Target数据源

# vim /usr/local/dbt-trino-mysql-clickhouse/dbt/dbt-profiles/profiles.yml
dbt_project:target: devoutputs:dev:type: trinohost: 192.168.137.128user: rootport: 10080database: target_dw_clickhouseschema: tjthreads: 4session_properties:query_max_run_time: 5dexchange_compression: True

配置转换脚本

将已编写的sql脚本放到**/usr/local/dbt-trino-mysql-clickhouse/dbt/app/dbt_project/models/example**目录下,我们这里写了一个这样的脚本:

/*matomo_log_visit_material.sql 仅仅只是将源数据库中的部分字段摘除出来,增加了一些基础过滤*/
{{ config(materialized='incremental') }}with matomo_visit_material as (selectidvisit,idsite,user_id,DATE_FORMAT(visit_first_action_time,'%Y-%m-%d') as visit_first_action_time,visit_total_time,visit_goal_buyer,referer_type,location_ipfrom {{ var("matomo_catalog") }}.{{ var("matomo_schema") }}.matomo_log_visit mlv{% if is_incremental() %}wheremlv.idvisit  > (select max(idvisit) from {{ this }}){% endif %})select * from matomo_visit_material

配置加载models

我们这里暂时就一个model,所以也不存在执行顺序问题,如果有多个转换脚本,存在上下文顺序,注意dbt的加载顺序为从上到下:

#vim /usr/local/dbt-trino-mysql-clickhouse/dbt/app/dbt_project/models/example/schema.yml
version: 2
models:- name: matomo_log_visit_materialdescription: "A starter dbt model"columns:- name: idvisitdescription: "The primary key for this table"tests:- unique- not_null

执行数据清洗

启动各服务容器,执行dbt run测试查看执行结果:

在这里插入图片描述

切换到clickhouse数据库,可以看到数据表结构已生成,数据已打通

在这里插入图片描述

多数据源统计

流程图

在这里插入图片描述

Trino配置Connector

我们这里需要配置三个Connector,分别是:

source_dw_matomo_mysql.properties

source_dw_ngbilling_mysql.properties

target_dw_clickhouse.properties

我这里只起了一个服务,所以看上去source_dw_matomo_mysql.properties和source_dw_ngbilling_mysql.properties地址是一样的,没关系,我们到时候在DBT中分别指定不同的source db name即可!实际开发中分别按自己的编写即可!

#/usr/local/dbt-trino-mysql-clickhouse/trino/etc/catalog/source_dw_matomo_mysql.properties
connector.name=mysql
connection-url=jdbc:mysql://192.168.137.128:3306?enabledTLSProtocols=TLSv1.2&useSSL=false
connection-user=root
connection-password=root#/usr/local/dbt-trino-mysql-clickhouse/trino/etc/catalog/source_dw_matomo_mysql.properties
connector.name=mysql
connection-url=jdbc:mysql://192.168.137.128:3306?enabledTLSProtocols=TLSv1.2&useSSL=false
connection-user=root
connection-password=root#/usr/local/dbt-trino-mysql-clickhouse/trino/etc/catalog/target_dw_clickhouse.properties
connector.name=clickhouse
connection-url=jdbc:clickhouse://192.168.137.128:8123
connection-user=default
connection-password=

DBT配置

工程加载文件

 #vim /usr/local/dbt-trino-mysql-clickhouse/dbt/app/dbt_project/dbt_project.yml#如果该文件已自动,则增加如下配置即可,作为源库:全局数据库服务和库名变量:vars:matomo_catalog: source_dw_matomo_mysqlmatomo_schema: matomongbilling_catalog: source_dw_ngbilling_mysqlngbilling_schema: ngbilling
----------------------------------------------------------------------------# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'dbt_project'
version: '1.0.0'
config-version: 2# This setting configures which "profile" dbt uses for this project.
profile: 'dbt_project'# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models/example"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]target-path: "target"  # directory which will store compiled SQL files
clean-targets:         # directories to be removed by `dbt clean`- "target"- "dbt_packages"vars:matomo_catalog: source_dw_matomo_mysqlmatomo_schema: matomongbilling_catalog: source_dw_ngbilling_mysqlngbilling_schema: ngbilling# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:# Config indicated by + and applies to all files under models/example/example:+materialized: table

配置Target数据源

不变,与上相同

配置统计脚本

根据需求,我们需要统计matomo.matomo_log_visit表中每日活跃用户数与ngbilling.b_daily_job中每日活跃用户数比例,编写如下统计脚本并将脚本放到**/usr/local/dbt-trino-mysql-clickhouse/dbt/app/dbt_project/models/example**目录下:

/*matomo_ngbilling_user_login.sql*/{{ config(materialized='table') }}with matomo_ngbilling_user_login as ( SELECT DATE_FORMAT(NOW(),'%Y-%m-%d') as create_at,SUM(IF(c.source_from = 'matomo',c.cnt,0)) as matomoCnt,SUM(IF(c.source_from = 'ngbilling',c.cnt,0)) as ngbillingCntFROM(SELECT 'matomo' source_from, COUNT(DISTINCT a.user_id) cnt FROM {{ var("matomo_catalog") }}.{{ var("matomo_schema") }}.matomo_log_visit a WHERE DATE_FORMAT(a.visit_first_action_time,'%Y-%m-%d') = DATE_FORMAT(NOW(),'%Y-%m-%d')UNION ALLSELECT 'ngbilling' source_from, COUNT(DISTINCT b.user_id) cnt FROM {{ var("ngbilling_catalog") }}.{{ var("ngbilling_schema") }}.b_daily_job b WHERE DATE_FORMAT(b.submit_time,'%Y-%m-%d') = DATE_FORMAT(NOW(),'%Y-%m-%d'))c)select * from matomo_ngbilling_user_login

配置加载models

#vim /usr/local/dbt-trino-mysql-clickhouse/dbt/app/dbt_project/models/example/schema.yml
version: 2
models:- name: matomo_log_visit_materialdescription: "A starter dbt model"columns:- name: idvisitdescription: "The primary key for this table"tests:- unique- not_null- name: matomo_ngbilling_user_logindescription: "A starter dbt model"columns:- name: create_atdescription: "The primary key for this table"tests:- unique- not_null

执行统计


在这里插入图片描述

相关内容

热门资讯

喜欢穿一身黑的男生性格(喜欢穿... 今天百科达人给各位分享喜欢穿一身黑的男生性格的知识,其中也会对喜欢穿一身黑衣服的男人人好相处吗进行解...
发春是什么意思(思春和发春是什... 本篇文章极速百科给大家谈谈发春是什么意思,以及思春和发春是什么意思对应的知识点,希望对各位有所帮助,...
网络用语zl是什么意思(zl是... 今天给各位分享网络用语zl是什么意思的知识,其中也会对zl是啥意思是什么网络用语进行解释,如果能碰巧...
为什么酷狗音乐自己唱的歌不能下... 本篇文章极速百科小编给大家谈谈为什么酷狗音乐自己唱的歌不能下载到本地?,以及为什么酷狗下载的歌曲不是...
华为下载未安装的文件去哪找(华... 今天百科达人给各位分享华为下载未安装的文件去哪找的知识,其中也会对华为下载未安装的文件去哪找到进行解...
怎么往应用助手里添加应用(应用... 今天百科达人给各位分享怎么往应用助手里添加应用的知识,其中也会对应用助手怎么添加微信进行解释,如果能...
家里可以做假山养金鱼吗(假山能... 今天百科达人给各位分享家里可以做假山养金鱼吗的知识,其中也会对假山能放鱼缸里吗进行解释,如果能碰巧解...
一帆风顺二龙腾飞三阳开泰祝福语... 本篇文章极速百科给大家谈谈一帆风顺二龙腾飞三阳开泰祝福语,以及一帆风顺二龙腾飞三阳开泰祝福语结婚对应...
美团联名卡审核成功待激活(美团... 今天百科达人给各位分享美团联名卡审核成功待激活的知识,其中也会对美团联名卡审核未通过进行解释,如果能...
四分五裂是什么生肖什么动物(四... 本篇文章极速百科小编给大家谈谈四分五裂是什么生肖什么动物,以及四分五裂打一生肖是什么对应的知识点,希...