一、腾讯云NLP服务解决的问题
具备自然语言处理(NLP)能力是企业日趋紧迫的一个需求,例如电商网站需从用户评论中分析出产品偏好,金融企业需对产品进行舆论分析等。企业如果自研NLP相关能力,不仅需要投入专业的技术人员、收集或购买大量的语料,还必须经历漫长的技术周期,最终效果往往还达不到预期。
腾讯云NLP服务深度整合了内部顶级的NLP技术,并依托千亿级的中文余料积累,提供了包括词法分析在内的16项智能文本处理能力。这些能力开箱即用,无需购买或运维服务器,省去了企业大了的人物和物力投入。本文结合腾讯云云函数服务,通过一个简化的示例介绍如何基于腾讯云生态快速打造词法分析服务。
二、腾讯云NLP词法分析接口
腾讯云NLP词法分析相关接口包括2个:相似词和智能词法分析。本文基于词法分析接口,介绍电商网站如何对收集的用户评论进行分词、词性标注以及命名实体识别,从而构建词法分析系统。
词法分析接口主要功能包括(具体接口说明可参见:https://cloud.tencent.com/document/product/271/35494):
- 分词:将连续的语句划分成合理的词汇序列
- 词性标注:为词汇标注对应的词性,消除词汇的歧义等,便于后续深层次的语义处理
- 命名实体识别:识别语句中的实体,如地点、人名、时间等,为后续识别实体间的关系做准备
该词法分析系统的业务场景如下所示:
1、网站业务系统持续收集用户评论,定期产生评论的文本文件,上传到COS桶中;
2、COS服务自动触发腾讯云云函数服务,词法分析云函数会调用NLP的词法分析接口,并获取分词、词性标注以及命名实体识别结果;
3、词法分析云函数将分析结果送入kafka,并由下游的服务消费写入MySQL或ES等服务,供进一步的处理。
三、具体实现步骤
本系统的核心在于词法分析云函数的实现,默认kafka及下游的ES、MySQL都已创建。
1、创建词法分析云函数
该函数主要实现三个功能:
- 接收COS的触发信息,根据触发信息下载用户评论文本
- 调用NLP词法分析接口,对文本进行处理
- 将分析的结果送入kafka
词法分析云函数的代码如下:
代码语言:javascript复制# -*- coding: utf8 -*-
from tencentcloud.common import credential
from tencentcloud.common.profile.client_profile import ClientProfile
from tencentcloud.common.profile.http_profile import HttpProfile
from tencentcloud.common.exception.tencent_cloud_sdk_exception import TencentCloudSDKException
from tencentcloud.nlp.v20190408 import nlp_client, models
from qcloud_cos import CosConfig
from qcloud_cos import CosS3Client
from qcloud_cos import CosClientError
from qcloud_cos import CosServiceError
from pykafka.exceptions import ConsumerStoppedException
from pykafka.client import KafkaClient
from pykafka.common import OffsetType
import sys
import logging
SECRET_ID = "xxxxxxxxxxxxxxxxxxx"
SECRET_KEY = "xxxxxxxxxxxx"
APP_ID = "xxxxxxxxxxxx"
REGION = "ap-guangzhou"
ENDPOINT = "nlp.tencentcloudapi.com"
# kafka相关配置,根据实际情况填写
TOPIC = "comment_seg"
CONSUMER_GROUP = "comment_seg_1"
KAFKA_ADDRESS = "192.168.0.4:9092,192.168.0.5:9092"
ZK_ADDRESS = "192.168.0.4:2181,192.168.0.5:2181"
kafka_client = KafkaClient(hosts=KAFKA_ADDRESS)
topic = kafka_client.topics[TOPIC]
def main_handler(event, context):
try:
# 创建nlp客户端
cred = credential.Credential(SECRET_ID, SECRET_KEY)
httpProfile = HttpProfile()
httpProfile.endpoint = ENDPOINT
clientProfile = ClientProfile()
clientProfile.httpProfile = httpProfile
client = nlp_client.NlpClient(cred, REGION, clientProfile)
# 创建cos客户端
config = CosConfig(Region=REGION, SecretId=SECRET_ID, SecretKey=SECRET_KEY, Token=None, Scheme="https")
cos_client = CosS3Client(config)
logger = logging.getLogger()
# 从cos将文件下载到tmp文件夹
for record in event['Records']:
try:
bucket = record['cos']['cosBucket']['name'] '-' str(APP_ID)
key = record['cos']['cosObject']['key']
key = key.replace('/' str(APP_ID) '/' record['cos']['cosBucket']['name'] '/', '', 1)
logger.info("Key is " key)
logger.info("Get from [%s] to download file [%s]" % (bucket, key))
download_path = '/tmp/{}'.format(key)
try:
response = cos_client.get_object(Bucket=bucket, Key=key)
response['Body'].get_stream_to_file(download_path)
except CosServiceError as e:
print(e.get_error_code())
print(e.get_error_msg())
print(e.get_resource_location())
return "Fail"
logger.info("Download file [%s] Success" % key)
except Exception as e:
print(e)
print('Error getting object {} from bucket {}. '.format(key, bucket))
raise e
return "Fail"
# 读取文件内容
f = open(download_path)
for line in f.readlines():
logger.info("Line:[%s]" % line)
req = models.LexicalAnalysisRequest() # 调用词法分析接口
params = '{"Flag":1,"Text":"' line '"}'
req.from_json_string(params)
resp = client.LexicalAnalysis(req)
print(resp.to_json_string())
# 将原始的文本和词法分析结果发送到kafka
to_kafka('{"Text":%s, "Results":%s}' % (line, resp.to_json_string()))
f.close()
except Exception as ex:
print("======================")
print(ex)
def to_kafka(msg):
with topic.get_sync_producer() as producer:
producer.produce(msg)
云函数部署配置文件如下:
代码语言:javascript复制Resources:
default:
Type: TencentCloud::Serverless::Namespace
lexical-demo:
Type: TencentCloud::Serverless::Function
Properties:
CodeUri: ./
Type: Event
Description: This is a template function
Role: SCF_QcsRole
Environment:
Variables:
ENV_FIRST: env1
ENV_SECOND: env2
Handler: index.main_handler
MemorySize: 128
Runtime: Python2.7
Timeout: 3
#VpcConfig:
# VpcId: 'vpc-qdqc5k2p'
# SubnetId: 'subnet-pad6l61i'
#Events:
# timer:
# Type: Timer
# Properties:
# CronExpression: '*/5 * * * *'
# Enable: True
# cli-appid.cos.ap-beijing.myzijiebao.com: # full bucket name
# Type: COS
# Properties:
# Bucket: cli-appid.cos.ap-beijing.myzijiebao.com
# Filter:
# Prefix: filterdir/
# Suffix: .jpg
# Events: cos:ObjectCreated:*
# Enable: True
# topic: # topic name
# Type: CMQ
# Properties:
# Name: qname
# hello_world_apigw: # ${FunctionName} '_apigw'
# Type: APIGW
# Properties:
# StageName: release
# ServiceId:
# HttpMethod: ANY
Globals:
Function:
Timeout: 10
在本地通过SCF CLI部署:
代码语言:javascript复制scf deploy -f --cos-bucket temp-code-1300312696
函数成功部署:
2、配置词法分析云函数触发器
在词法分析云函数的”触发器管理“界面中配置用户评论文本存储的bucket以及事件类型,点击提交。
四、效果展示
上传到COS桶的文件每行一条评论,内容示例如下:
代码语言:javascript复制店家发货送了双白色袜子,穿起来好舒服
鞋已收到试穿了下,还挺合适,明天去球场上验证下战靴,看下实战怎么样
有点硬邦邦的,第一次买球鞋,感觉还不错
当有文件上传到user-comment桶时,词法分析云函数将会自动被触发,通过云函数的日志查询功能可查看调用记录。其中,词法分析的结果示例如下:
代码语言:javascript复制{
"NerTokens": null,
"PosTokens": [{
"Length": 2,
"Word": "店家",
"BeginOffset": 0,
"Pos": "n"
}, {
"Length": 2,
"Word": "发货",
"BeginOffset": 2,
"Pos": "v"
}, {
"Length": 1,
"Word": "送",
"BeginOffset": 4,
"Pos": "v"
}, {
"Length": 1,
"Word": "了",
"BeginOffset": 5,
"Pos": "u"
}, {
"Length": 1,
"Word": "双",
"BeginOffset": 6,
"Pos": "m"
}, {
"Length": 2,
"Word": "白色",
"BeginOffset": 7,
"Pos": "n"
}, {
"Length": 2,
"Word": "袜子",
"BeginOffset": 9,
"Pos": "n"
}, {
"Length": 1,
"Word": ",",
"BeginOffset": 11,
"Pos": "w"
}, {
"Length": 1,
"Word": "穿",
"BeginOffset": 12,
"Pos": "v"
}, {
"Length": 2,
"Word": "起来",
"BeginOffset": 13,
"Pos": "v"
}, {
"Length": 1,
"Word": "好",
"BeginOffset": 15,
"Pos": "a"
}, {
"Length": 3,
"Word": "好舒服",
"BeginOffset": 15,
"Pos": "a"
}, {
"Length": 2,
"Word": "舒服",
"BeginOffset": 16,
"Pos": "a"
}],
"RequestId": "5597cfb6-64f5-42d0-8727-866c400d9778"
}
五、总结
本文展示了如何基于腾讯云生态,快速搭建一套词法分析系统。对于企业来讲,其无需投入NLP专业人员即可在短时间内构建起一套这样的系统。实际上,结合NLP服务的其它服务,如文本分类、情感分析等,还可以打造功能更为多样语义分析能力,帮助企业实现从数据到商业洞察的飞跃。