通过Monstache实时同步MongoDB数据至Elasticsearch

2022-08-21 18:56:14 浏览数 (1)

背景信息

• 通过Monstache快速同步及订阅全量或增量数据。

• 将MongoDB数据实时同步至高版本Elasticsearch。

• 解读Monstache常用配置参数,应用于更多的业务场景。

环境准备

MongoDB:5.0.11

Elasticsearch:7.10.1

Monstache:rel6

一.搭建go环境

Monstache依赖go环境,所以在安装Monstache之前需要先进行go环境的安装。

1.获取go依赖包

代码语言:javascript复制
wget https://go.dev/dl/go1.17.5.linux-amd64.tar.gz

2.解压go压缩包

代码语言:javascript复制
tar -zxvf go1.17.5.linux-amd64.tar.gz

3.导入go环境变量

代码语言:javascript复制
export PATH=$PATH:/softpackage/go/bin
source /etc/profile

4.验证go版本

代码语言:javascript复制
go version

二.安装Monstache

安装前根据自身MongoDB与Elasticsearch版本选择相应的Monstache进行安装。

image.pngimage.png

1.从Git库中下载项目代码

代码语言:javascript复制
git clone https://github.com/rwynn/monstache.git
image.pngimage.png

如果提示-bash: git: 未找到命令,可使用

代码语言:javascript复制
yum install git

进行Git工具的安装。

2.进入Monstache目录

代码语言:javascript复制
cd monstache/

3.切换Monstache版本(具体版本根据之前的组件版本信息进行选择。)

代码语言:javascript复制
git checkout rel6

这里因为Elasticsearch集群是7.10版本,所以monstache选择rel6版本

4.安装Monstache

代码语言:javascript复制
go install
image.pngimage.png

5.在安装目录下查看Monstache版本

代码语言:javascript复制
./bin/monstache -v

安装成功后效果如图下所示:

三.配置实时同步任务

在安装目录下手动创建Monstache配置使用TOML格式,默认情况下,Monstache会使用默认端口连接本地主机上的Elasticsearch和MongoDB,并追踪MongoDB oplog。在Monstache运行期间,MongoDB的任何更改都会同步到Elasticsearch中。

由于本文使用的是自建MongoDB和Elasticsearch,并且需要指定同步对象(testdb数据库中的user_info集合),因此要修改默认的Monstache配置文件。修改方式如下:

1.进入Monstache安装目录,创建并编辑配置文件。

代码语言:javascript复制
cd /root/go/monstache
vim config.toml

2.参考以下示例,修改配置文件。简单的配置示例如下,详细配置请参见Monstache Usage。

代码语言:javascript复制
#connection settings
# connect to MongoDB using the following URL
mongo-url = "mongodb://root:<your_mongodb_password>IP:27017"
# connect to the Elasticsearch REST API at the following node URLs
elasticsearch-urls = "http://IP:9200"
# frequently required settings
# if you need to seed an index from a collection and not just listen and sync changes events
# you can copy entire collections or views from MongoDB to Elasticsearch
direct-read-namespaces = "testdb.user_info"
# if you want to use MongoDB change streams instead of legacy oplog tailing use change-stream-namespaces
# change streams require at least MongoDB API 3.6 
# if you have MongoDB 4  you can listen for changes to an entire database or entire deployment
# in this case you usually don't need regexes in your config to filter collections unless you target the deployment.
# to listen to an entire db use only the database name.  For a deployment use an empty string.
#change-stream-namespaces = "mydb.col"
# additional settings
# if you don't want to listen for changes to all collections in MongoDB but only a few
# e.g. only listen for inserts, updates, deletes, and drops from mydb.mycollection
# this setting does not initiate a copy, it is only a filter on the change event listener
#namespace-regex = '^mydb.col$'
# compress requests to Elasticsearch
#gzip = true
# generate indexing statistics
#stats = true
# index statistics into Elasticsearch
#index-stats = true
# use the following PEM file for connections to MongoDB
#mongo-pem-file = "/path/to/mongoCert.pem"
# disable PEM validation
#mongo-validate-pem-file = false
# use the following user name for Elasticsearch basic auth
elasticsearch-user = "elastic"
# use the following password for Elasticsearch basic auth
elasticsearch-password = "<your_es_password>"
# use 4 go routines concurrently pushing documents to Elasticsearch
elasticsearch-max-conns = 4
# use the following PEM file to connections to Elasticsearch
#elasticsearch-pem-file = "/path/to/elasticCert.pem"
# validate connections to Elasticsearch
#elastic-validate-pem-file = true
# propogate dropped collections in MongoDB as index deletes in Elasticsearch
dropped-collections = true
# propogate dropped databases in MongoDB as index deletes in Elasticsearch
dropped-databases = true
# do not start processing at the beginning of the MongoDB oplog
# if you set the replay to true you may see version conflict messages
# in the log if you had synced previously. This just means that you are replaying old docs which are already
# in Elasticsearch with a newer version. Elasticsearch is preventing the old docs from overwriting new ones.
#replay = false
# resume processing from a timestamp saved in a previous run
resume = true
# do not validate that progress timestamps have been saved
#resume-write-unsafe = false
# override the name under which resume state is saved
#resume-name = "default"
# use a custom resume strategy (tokens) instead of the default strategy (timestamps)
# tokens work with MongoDB API 3.6  while timestamps work only with MongoDB API 4.0 
resume-strategy = 0
# exclude documents whose namespace matches the following pattern
#namespace-exclude-regex = '^mydb.ignorecollection$'
# turn on indexing of GridFS file content
#index-files = true
# turn on search result highlighting of GridFS content
#file-highlighting = true
# index GridFS files inserted into the following collections
#file-namespaces = "users.fs.files"
# print detailed information including request traces
verbose = true
# enable clustering mode
cluster-name = 'es-yjd'
# do not exit after full-sync, rather continue tailing the oplog
#exit-after-direct-reads = false
mapping
namespace = "testdb.user_info"
index = " user_info"
#type = ""

注:以上配置仅使用了部分参数完成数据实时同步,如果您有更复杂的同步需求,请参见Monstache configAdvanced进行配置。

3.运行任务

代码语言:javascript复制
./bin/monstache -f config.toml
image.pngimage.png
image.pngimage.png

注: 通过-f参数,您可以显式运行Monstache,系统会打印所有调试日志(包括对Elasticsearch的请求追踪)。

四.验证数据结果

MongoDB:

我们这里手动在MongoDB中插入了4条测试数据。

image.pngimage.png
代码语言:javascript复制
db.getCollection("user_info").find().count()
image.pngimage.png

Elasticsearch:

image.pngimage.png
代码语言:javascript复制
GET /user_info/_count
image.pngimage.png
image.pngimage.png

可以看到数据已经同步到了elasticsearch中。

FAQ1:安装Monstache过程中如果遇到Get "https://proxy.golang.org/github.com/!burnt!sushi/toml/@v/v1.0.0.mod": dial tcp 172.217.163.49:443: connect: connection refused的错误

主要是由于go env默认的地址无法访问到地址。所以需要在env中指定一个可以访问到go环境的地址。

执行

代码语言:javascript复制
go env -w GOPROXY=[https://goproxy.cn](https://goproxy.cn)

然后重新执行

代码语言:javascript复制
go install

安装monstache即可。

FAQ 2:克隆代码过程中报错fatal: unable to access 'https://github.com/rwynn/monstache.git/': TCP connection reset by peer

主要原因可能是网速慢,文件大。

可以将Http缓存设置大一些,比如1G 1048576000

,或者3G 3194304000

代码语言:javascript复制
git config --global http.postBuffer 1048576000

然后重新clone代码即可恢复正常。

0 人点赞