私有云下的自动化故障稳定性测试

1. 序章

1.1 为什么要做故障稳定性测试？

写这篇文章的目的是记录下这一个月的工作内容，也想把这10年来走过的技术路程拿出来分享分享。下面开始正式介绍。

这几年我们经常听到一些新闻, 比如12306挂了, 支付宝无法转账了, 微信无法使用了等内容, 这些关系着民生的应用有时候都会遇到问题。其实这些应用并不是被黑客攻击，而是在当年计算机技术越来越复杂的今天，遇到的故障不可控性越来越高, 技术专家想到了各种方案来预防这种不可控性的风险, 比如容灾/备份/集群高可用性/异地备份等方案，但是有些时候并不能一网打尽的做好所有预防处理。所以测试人员需要尽可能的模拟各种故障场景，来帮助技术专家做好风险预警。

1.2 故障稳定性测试自动化

传统的故障稳定性测试可能是通过人工的手段来执行一些操作，比如在测试环境中注入正常的数据流量,这时候去手动的关闭一些组件或者服务或者关闭物理机, 造成服务的不可用，这时候一个可能性较高技术架构下, 系统会马上切换到备份系统，来把系统故障造成的影响降低到最小。那测试人员在这个过程中可能就需要测试整个服务恢复的时间, 恢复完成后整个集群的稳定性等因素。如果用人工来做的话，我们需要耗费大量的时间来执行测试，而且有些场景人工是无法模拟出来的, 以及无法准确的抓取数据。所以我们需要自动化我们的故障稳定性方案。

1.3 为什么选取xrally作为故障稳定性的技术架构

xrally是openstack性能测试项目rally的一个新版本, 当前版本下xrally以插件化的形式支持openstack,docker,k8s等云环境。rally将openstack的代码从他自身剥离出来以rally-openstack的形式存在。

https://github.com/openstack/rally.git（rally整体框架）

https://github.com/openstack/rally-openstack.git（openstack支持插件）

我们将从xrally如何使用开始到如何二次开发适用于自己项目的xrally plugin代码的逐层的讲解。

2. rally的基本使用介绍

2.1 rally的安装

2.1.1 自动化脚本安装

wget -q -O- https://raw.githubusercontent.com/openstack/rally/master/install_rally.sh | bash

or using curl

curl https://raw.githubusercontent.com/openstack/rally/master/install_rally.sh | bash

2.1.2 容器化安装

制作rally的docker镜像，为什么需要自己做镜像, 这是因为我们需要灌注一些其他的python lib和自己开发的代码

代码语言：txt复制

将所有用到的工具灌注到docker中制作成镜像
FROM ubuntu:16.04
RUN sed -i s/^deb-src.*// /etc/apt/sources.list
RUN apt-get update && apt-get install --yes sudo python python-pip vim git-core && 

    pip install --upgrade pip && 

    useradd -u 65500 -m rally && 

    usermod -aG sudo rally && 

    echo "rally ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/00-rally-user

COPY ./source /home/rally/source

COPY motd /etc/motd

WORKDIR /home/rally/source

# ensure that we have all system packages installed

RUN pip install bindep &&  apt-get install --yes $(bindep -b | tr 'n' ' ')

RUN pip install -i https://pypi.douban.com/simple . --constraint upper-constraints.txt 
--ignore-installed PyYAML  && 

    pip install -i https://pypi.douban.com/simple pymysql --ignore-installed PyYAML && 

    pip install -i https://pypi.douban.com/simple psycopg2  --ignore-installed PyYAML && 

    pip install os-faults  && 

    pip install rally-runners && 

    mkdir /etc/rally && 

    echo "[database]" > /etc/rally/rally.conf && 

    echo "connection=sqlite:////home/rally/data/rally.db" >> /etc/rally/rally.conf

RUN echo '[ ! -z "$TERM" -a -r /etc/motd ] && cat /etc/motd' >> /etc/bash.bashrc

# Cleanup pip

RUN rm -rf /root/.cache/

USER root

ENV HOME /home/rally

RUN mkdir -p /home/rally/data && rally db recreate

RUN apt-get -y  install software-properties-common  && 

    apt-add-repository ppa:ansible/ansible && 

    apt-get update && 

    apt-get install -y ansible

# Docker volumes have specific behavior that allows this construction to work.

# Data generated during the image creation is copied to volume only when it's

# attached for the first time (volume initialization)

VOLUME ["/home/rally/data"]

2.2 rally的架构分析

Rally整体架构图

Rally plugin架构图, 在rally-plugin框架下

custom task scenario, runner, SLA, deployment or context都是作为一个plugin存在。

解释一下这几个组件的作用，

scenario: 用代码定义运行的场景

runner：定义运行周期, 并发数，运行次数等

SLA：对测试结果的一个定义

deployment：选择测试环境

context：上下文，初始化环境和清理环境等操作

2.3 rally功能介绍

2.3.1 创建环境和运行一个测试任务

在这里我们选择openstack作为我们的基础环境，如果你想选择k8s作为基础的云环境的话也是可以的。

安装rally-openstack package pip install rally-openstack{ "openstack": { "auth_url": "http://xx.xxx.xxx.x:xxxx/v3", "region_name": "RegionOne", "endpoint_type": "public", "admin": { "username": "admin", "password": "passw0rd", "tenant_name": "admin", "project_name": "admin", "project_domain_name": "default" }, "https_insecure": false, "https_cacert": "" } }
初始化rally数据库 rally db create
导入一个openstack环境到rally中生成一个deployment rally deployment create --name=openstack --file=/home/rally/data/admin-openrc.json admin.openrc.json的格式

运行一个task 以yaml格式写一个测试用例，后面会详细讲测试用例如何编写，这里先带过。

代码语言：txt复制

---
{% set repeat = repeat|default(3) %}
HttpRequests.ListInstance:
{% for iteration in range(repeat) %}
- args:
    url: http://xx.xxx.xx.xxx/api/resource/listVirtualMachine
    method: POST
    status_code: 200
    json:
      pageSize: 1
      pageNumber: 1
    headers:
      Content-Type: application/json
      charset: UTF-8
  runner:
    type: constant_for_duration
    duration: 60
    concurrency: 1
  context:
    users:
      tenants: 1
      users_per_tenant: 1
  sla:
    failure_rate:
      max: 0
  hooks:
  - name: fault_injection@openstack
    args:
      action: kill mysql service on one node
    trigger:
      name: event
      args:
        unit: iteration
        at:
        - 50
{% endfor %}

运行task: rally --plugin-paths /tmp/pycharm_project_371/rally-archos/scenarios/ListInstance.py task start --task /home/rally/data/List_Instance.yaml --deployment 4c7157d1-757e-488c-8ee4-fc976b80af58

在这个运行命令里--plugin-paths参数代表了测试用例对应的代码路径，--task代表测试用例路径，--deployment代表测试环境。

导出运行报告 rally task report 8176ce9a-6c3a-473a-bb25-ac0279b04bdd --out output.html
修改报告格式因为原报告中的js被墙的关系，需要替换成国内的地址 <link rel="stylesheet" href="http://cdn.bootcss.com/nvd3/1.1.15-beta/nv.d3.css"> <script type="text/javascript" src="http://cdn.bootcss.com/angular.js/1.3.3/angular.min.js"></script> <script type="text/javascript" src="http://cdn.bootcss.com/d3/3.4.13/d3.min.js"></script> <script type="text/javascript" src="http://cdn.bootcss.com/nvd3/1.1.15-beta/nv.d3.min.js"></script>3. rally plugin开发怎么创建一个plugin, 首先我们需要导入class: rally.task.scenario.configure, 且在@scenario.configure(name="my_new_plugin_name")中定义testcase的名字, 这里定义的名字将会在yaml文件中使用, 所以2者必须完全匹配上。from rally.task import scenario @scenario.configure(name="my_new_plugin_name") class MyNewPlugin(plugin.Plugin): pass

3.1 context as a plugin

介绍一个context能做什么，这个plugin能够在 scenario iteration开始之前执行初始化操作，比如我们在执行一个场景用例创建虚拟机之前, 需要提前准备下载Image，那就可以把download image这个操作放在context这个plugin中。context将在scenario循环开始和结束之前和之后做为一个处理化和清理的任务。

代码语言：txt复制

from rally.task import contextfrom rally.common import loggingfrom rally import constsfrom rally.plugins.openstack import osclients

LOG = logging.getLogger(__name__)


@context.configure(name="create_flavor", order=1000)class CreateFlavorContext(context.Context):
    """This sample creates a flavor with specified option."""

    CONFIG_SCHEMA = {
        "type": "object",
        "$schema": consts.JSON_SCHEMA,
        "additionalProperties": False,
        "properties": {
            "flavor_name": {
                "type": "string",
            },
            "ram": {
                "type": "integer",
                "minimum": 1
            },
            "vcpus": {
                "type": "integer",
                "minimum": 1
            },
            "disk": {
                "type": "integer",
                "minimum": 1
            }
        }
    }

    def setup(self):
        """This method is called before the task starts."""
        try:
            # use rally.osclients to get necessary client instance
            nova = osclients.Clients(self.context["admin"]["credential"]).nova()
            # and than do what you need with this client
            self.context["flavor"] = nova.flavors.create(
                # context settings are stored in self.config
                name=self.config.get("flavor_name", "rally_test_flavor"),
                ram=self.config.get("ram", 1),
                vcpus=self.config.get("vcpus", 1),
                disk=self.config.get("disk", 1)).to_dict()
            LOG.debug("Flavor with id '%s'" % self.context["flavor"]["id"])
        except Exception as e:
            msg = "Can't create flavor: %s" % e.message
            if logging.is_debug():
                LOG.exception(msg)
            else:
                LOG.warning(msg)

    def cleanup(self):
        """This method is called after the task finishes."""
        try:
            nova = osclients.Clients(self.context["admin"]["credential"]).nova()
            nova.flavors.delete(self.context["flavor"]["id"])
            LOG.debug("Flavor '%s' deleted" % self.context["flavor"]["id"])
        except Exception as e:
            msg = "Can't delete flavor: %s" % e.message
            if logging.is_debug():
                LOG.exception(msg)
            else:
                LOG.warning(msg)

我们拿官方给到case来详细的说一下里面的写法，

**@context.configure(name="create_flavor", order=1000) ----->定义context的名字

class CreateFlavorContext(context.Context) ----->继承context.context的父类

def setup(self) -----> 初始化操作

def cleanup(self) -----> 清理操作

self.context -----> 所有的数据需要放置到self.context这个dict中, 作为返回值

LOG = logging.getLogger(name) -----> 在整个rally架构中，使用log打印日志

CONFIG_SCHEMA -----> init to format json data**

3.2 hooks as a plugin

3.2.1 Hooks是什么？

我们需要测试当我们配置变更和一些基础组件重启对整个性能和稳定性的影响。在rally架构中，hooks能够模拟绝大数故障。因为我们这里需要模拟一些系统关机, 重启等操作，所以用到了一个第三方的Lib.

os-faults被用来模拟系统故障，它通过ansible脚本来控制集群。

3.2.2 如何使用Hooks

我们这里面调用了os-faults的human-api来kill mysql process，在这里提到的action中的内容, 都需要在os-faluts.yaml中配置，而且需要将其导入至环境变量。

代码语言：txt复制

  hooks:
  - name: fault_injection@openstack
    args:
      action: kill mysql service on one node
    trigger:
      name: event
      args:
        unit: iteration
        at:
        - 5

export OS_FAULTS_CONFIG=os-faults.yaml

代码语言：txt复制

cloud_management:
  driver: universal

node_discover:
  driver: node_list
  args:
  - ip: 10.xxx.xx.14
    auth:
      username: root
      password: *********
      private_key_file: cloud_key
  - ip: 10.xxx.xx.15
    auth:
      username: root
      password: *********
      private_key_file: cloud_key
  - ip: 10.xxx.xx.16
    auth:
      username: root
      password: *********
      private_key_file: cloud_key

power_managements:
- driver: ipmi
  args:
    mac_to_bmc:
      EC:38:8F:7D:40:FF:
        address: 172.xx.xx.118
        username: root
        password: *********
      E8:4D:D0:B3:CC:3D:
        address: 172.xx.xx.119
        username: root
        password: *********
      E8:4D:D0:B3:CA:D0:
        address: 172.xx.xx.120
        username: root
        password: *********

services:
  nova-api:
    driver: system_service
    args:
      service_name: nova-api
      grep: nova-api
  glance-api:
    driver: system_service
    args:
      service_name: glance-api
      grep: glance-api
  identity:
    driver: system_service
    args:
      service_name: identity
      grep: identity
  memcached:
    driver: system_service
    args:
      service_name: memcached
      grep: memcached
  mysql:                 # name of the service
    driver: process      # name of the service driver
    args:                # arguments for the driver
      grep: mysqld
      port:
      - tcp
      - 3306
      restart_cmd: sudo systemctl restart mariadb
      start_cmd: sudo systemctl start mariadb
      terminate_cmd: sudo systemctl stop mariadb
  rabbitmq-server:
    driver: system_service
    args:
      service_name: rabbitmq-server
      grep: rabbitmq-server
    hosts:
    - 172.16.170.13

3.3 scenario runner as a plugin

3.3.1 scenario runner是什么？

scenario runner被用来定义scenario以一种什么样子的格式循环运行

rally自身提供了4种执行模式，一般需要用到特别的情况才需要自己开发, 其他情况下并不需要。

3.3.2 scenario runner模板

代码语言：txt复制

import random

from rally.task import runnerfrom rally import consts


@runner.configure(name="random_times")class RandomTimesScenarioRunner(runner.ScenarioRunner):
    """Sample scenario runner plugin.

    Run scenario random number of times (between min_times and max_times)    """

    CONFIG_SCHEMA = {
        "type": "object",
        "$schema": consts.JSON_SCHEMA,
        "properties": {
            "type": {
                "type": "string"
            },
            "min_times": {
                "type": "integer",
                "minimum": 1
            },
            "max_times": {
                "type": "integer",
                "minimum": 1
            }
        },
        "additionalProperties": True
    }

    def _run_scenario(self, cls, method_name, context, args):
        # runners settings are stored in self.config
        min_times = self.config.get('min_times', 1)
        max_times = self.config.get('max_times', 1)

        for i in range(random.randrange(min_times, max_times)):
            run_args = (i, cls, method_name,
                        runner._get_scenario_context(context), args)
            result = runner._run_scenario_once(run_args)
            # use self.send_result for result of each iteration
            self._send_result(result)

3.4 scenario as a plugin

3.4.1 什么是scenario?

之前提到的context, scenario runner等plugin都是为了scenario服务的，scenario将被重复执行来验证系统的性能和稳定性。

3.4.2 scenario模板

代码语言：txt复制

from rally import constsfrom rally.plugins.openstack import scenariofrom rally.task import atomicfrom rally.task import validation


@validation.add("required_services", services=[consts.Service.NOVA])@validation.add("required_platform", platform="openstack", users=True)@scenario.configure(name="ScenarioPlugin.list_flavors_useless")class ListFlavors(scenario.OpenStackScenario):
    """Sample plugin which lists flavors."""

    @atomic.action_timer("list_flavors")
    def _list_flavors(self):
        """Sample of usage clients - list flavors

        You can use self.context, self.admin_clients and self.clients        which are initialized on scenario instance creation"""
        self.clients("nova").flavors.list()

    @atomic.action_timer("list_flavors_as_admin")
    def _list_flavors_as_admin(self):
        """The same with admin clients"""
        self.admin_clients("nova").flavors.list()

    def run(self):
        """List flavors."""
        self._list_flavors()
        self._list_flavors_as_admin()

@scenario.configure(name="ScenarioPlugin.list_flavors_useless") -----> 定义scenario的名字

class ListFlavors(scenario.OpenStackScenario) -----> 你想要继承的父类

def run(self) -----> 执行入口函数

@atomic.action_timer("list_flavors") -----> 系统的子任务，能够使用 self.context, self.admin_clients and self.clients

3.5 SLA as a plugin

略

3.6 VerificationReporter as a plugin

3.6.1 什么是VerificationReporter？

VerificationReporter 中定义了最后系统报告的输出格式，当前系统提供了html, json, junit-xml形式的报告。我们也可以自己定义report格式。

3.6.2 VerificationReporter模板

代码语言：txt复制

import json

from rally.verification import reporter


@reporter.configure("summary-in-json")class SummaryInJsonReporter(reporter.VerificationReporter):
    """Store summary of verification(s) in JSON format"""

    # ISO 8601
    TIME_FORMAT = "%Y-%m-%dT%H:%M:%S%z"

    @classmethod
    def validate(cls, output_destination):
        # we do not have any restrictions for destination, so nothing to
        # check
        pass

    def generate(self):
        report = {}

        for v in self.verifications:
            report[v.uuid] = {
                "started_at": v.created_at.strftime(self.TIME_FORMAT),
                "finished_at": v.updated_at.strftime(self.TIME_FORMAT),
                "status": v.status,
                "run_args": v.run_args,
                "tests_count": v.tests_count,
                "tests_duration": v.tests_duration,
                "skipped": v.skipped,
                "success": v.success,
                "expected_failures": v.expected_failures,
                "unexpected_success": v.unexpected_success,
                "failures": v.failures,
                # v.tests includes all information about launched tests,
                # but for simplification of this fake reporters, let's
                # save just names
                "launched_tests": [test["name"]
                                   for test in v.tests.values()]
            }

        raw_report = json.dumps(report, indent=4)

        if self.output_destination:
            # In case of output_destination existence report will be saved
            # to hard drive and there is nothing to print to stdout, so
            # "print" key is not used
            return {"files": {self.output_destination: raw_report},
                    "open": self.output_destination}
        else:
            # it is something that will be print at CLI layer.
            return {"print": raw_report}

3.7 VerifierManager as a plugin

3.7.1 什么是VerifierManager？

Verify 被用来验证系统的完整性、可行性。

载入tempest来验证openstack功能

如果我们需要引入第三方工具的话，我们就需要用到VerifierManager。

3.7.2 VerifierManager模板

代码语言：txt复制

import randomimport re
from rally.verification import manager
# Verification component expects that method "run" of verifier returns# object. Class Result is a simple wrapper for two expected properties.class Result(object):
    def __init__(self, totals, tests):
        self.totals = totals
        self.tests = tests


@manager.configure("fake-tool", default_repo="https://example.com")class FakeTool(manager.VerifierManager):
    """Fake Tool o/"""

    TESTS = ["fake_tool.tests.bar.FatalityTestCase.test_one",
             "fake_tool.tests.bar.FatalityTestCase.test_two",
             "fake_tool.tests.bar.FatalityTestCase.test_three",
             "fake_tool.tests.bar.FatalityTestCase.test_four",
             "fake_tool.tests.foo.MegaTestCase.test_one",
             "fake_tool.tests.foo.MegaTestCase.test_two",
             "fake_tool.tests.foo.MegaTestCase.test_three",
             "fake_tool.tests.foo.MegaTestCase.test_four"]

    # This fake verifier doesn't launch anything, just returns random
    #  results, so let's override parent methods to avoid redundant
    #  clonning repo, checking packages and so on.

    def install(self):
        pass

    def uninstall(self, full=False):
        pass

    # Each tool, which supports configuration, has the own mechanism
    # for that task. Writing unified method is impossible. That is why
    # `VerificationManager` implements the case when the tool doesn't
    # need (doesn't support) configuration at all. Such behaviour is
    # ideal for FakeTool, since we do not need to change anything :)

    # Let's implement method `run` to return random data.
    def run(self, context):
        totals = {"tests_count": len(self.TESTS),
                  "tests_duration": 0,
                  "failures": 0,
                  "skipped": 0,
                  "success": 0,
                  "unexpected_success": 0,
                  "expected_failures": 0}
        tests = {}
        for name in self.TESTS:
            duration = random.randint(0, 10000)/100.
            totals["tests_duration"]  = duration
            test = {"name": name,
                    "status": random.choice(["success", "fail"]),
                    "duration": "%s" % duration}
            if test["status"] == "fail":
                test["traceback"] = "Ooooppps"
                totals["failures"]  = 1
            else:
                totals["success"]  = 1
            tests[name] = test
        return Result(totals, tests=tests)

    def list_tests(self, pattern=""):
        return [name for name in self.TESTS if re.match(pattern, name)]

4. 使用rally生成report

4.1 生成html report

rally task report 484204c2-505c-490e-b55f-c52949802333 --out output.html

4.2 生成json数据

rally task report 484204c2-505c-490e-b55f-c52949802333 --json --out output.json

4.3 生成trend图

rally task trends --tasks d2aadbb1-0faf-48ee-9c2f-4184ed83eabe --out output-trends.html

4.4 生成故障分析图

1). Generate results

rally task results 065626f2-b13d-4400-94b2-dea6644e3858 > output.json

2). 通过rally_runner 脚本中的report函数生成index..rst

代码语言：txt复制

因为原生的脚本不匹配与当前版本了，所以我修改了部分代码，调用make_report生成rst文件

代码语言：txt复制

文件内容：

3). 使用sphinx把rst转成html

代码语言：txt复制

pip install sphinx

Sphinx-build –b html . _build

5. 分析故障测试报告

首先了解以下几个参数：

Recovery period - 故障后服务性能下降的一段时间

MTTR - 故障后恢复服务性能的平均时间

Service Downtime - 服务失效时间

Absolute performance degradation -

绝对性能下降是指恢复期间操作持续时间的平均值与基线之间的绝对差值

Relative performance degradation -

相对性能下降—恢复期间的平均运行时间与基线之间的比率。

Fault injection - 模拟软件或硬件故障的功能

Service hang -

模拟通过向服务进程发送SIGSTOP和SIGCONT POSIX信号来模拟挂起服务的错误。

Service crash -

通过向服务进程发送SIGKILL信号来模拟异常程序终止的故障。

Node crash - 模拟硬件意外断电的故障。

Network partition -

模拟导致运行在不同硬件节点上的服务组件之间的连接性丢失的故障;用于在HA服务中切换裂脑状态。

Network flapping - 模拟硬件节点或交换机上的网络接口断开的故障

openstack javascript http https 网络安全

0 人点赞