使用Pixie检测SQL注入

作者：Elaine Laguerta、Hannah Stepanek、Robert Prast

客座文章来自 New Relic 员工。

考虑你发布的代码中可能存在的安全漏洞。什么事让你夜不能寐？你最好回答“SQL 注入”——毕竟，自 2003 年以来，它一直是 OWASP（开放 Web 应用程序安全项目）十大 CVE[1]（常见漏洞和暴露）之一。

假设你有一个接受 id 查询参数的端点：

代码语言：javascript复制

http://example.com.com/api/host?id=1

事实证明，id 参数没有被正确地消毒，它是可注入的。黑客可能会出现并做如下事情：

代码语言：javascript复制

http://example.com.com/api/host?id=1 UNION SELECT current_user, NULL FROM user

在数据库级别，它将运行以下查询：

代码语言：javascript复制

SELECT name, ip FROM host WHERE id=1 UNION SELECT current_user, NULL FROM user

现在端点将返回数据库用户名，而不是返回属于该 id 的主机数据。可以想象，这可能非常糟糕，因为黑客刚刚在这个端点暴露了一个巨大的漏洞。任何使用此漏洞的用户都可以从数据库中获取他们想要的任何数据。

我们不会链接到任何关于破坏性黑客的新闻报道；这不是我们来这里的原因。作为开发人员和安全工程师，我们相信我们的工作不是恐吓，而是赋能。Pixie[2]是一个强大的可观察平台，作为安全人员，我们看到了一些将 Pixie 应用到安全用例的独特机会。在这里，我们将向你展示如何在应用程序运行时使用 Pixie 主动检测和报告 SQL 注入尝试。

与传统的 WAF（web 应用程序防火墙）和各种将自己定位为技术堆栈中的中间人的工具相比，Pixie 在集群中运行，并连接到内核底层的节点。具体来说，传统防火墙被限制只能查看网络流量的表面，而 Pixie 使用eBPF[3]（扩展的 Berkeley Packet Filter）跟踪来提供对操作系统本身的可见性。这使得 Pixie 可以完美地穿越 OSI（开放系统互连）模型中的大多数层来收集数据，而不是被归类到一个层中。在实践中，这意味着我们可以在应用程序层查看原始的 HTTP 和数据库请求，同时也可以剥离表示层中发生的任何加密。说白了，上下文就是王道，Pixie 允许我们理解每个上下文层的数据流。

为什么要检测注入尝试？为什么不积极地阻止他们呢？因为封锁是有效的，直到它失效。没有防火墙是 100% 长期有效的；最终有人下定决心会找到办法。当他们这样做时，直到攻击的后果出来后我们才会知道。

与阻断相比，检测可以为防御者提供更多的信息，而为攻击者提供更少的信息。例如，假设攻击者通过一些更明显的注入尝试来探测系统。这些可能是最可能被防火墙知道并被主动阻止的恶意查询。这意味着我们的防御者不会知道这些最初被阻止的尝试，而攻击者则有机会了解防火墙。现在有一个有利于攻击者的信息不对称。当我们的防御者因为我们的拦截器继续有一个盲点时，攻击者可以尝试更阴险的查询，直到他们通过防火墙。

检测允许我们在系统运行时观察对代码的攻击。我们能观察到什么，就能理解什么。了解如何将 SQL 注入这个可怕的东西变得更像杂草：它是增长代码库不可避免的一部分，而且它可能真的很糟糕。但如果我们能观察到它，我们就能把它消灭在萌芽状态。

为此，我们制作了一个简单的PxL 脚本[4]，使用 Pixie 标记可疑的数据库查询，这些查询似乎是 SQL 注入尝试。

这个脚本证明了更宏伟的愿景的概念。我们不想依赖防火墙作为系统的主要防御代理，因为防火墙对实际应用程序没有响应和上下文感知能力。我们需要一种工具，可以标记我们认为是注入的东西，但要足够聪明，在不阻断的情况下，将误报最小化。通过这种方式，我们人类可以完全看到企图攻击，我们对哪些事件构成严重企图有最终决定权。

在 New Relic，我们非常兴奋地使用 Pixie 开发了一个安全产品，它将实现这一愿景，这个工具将覆盖 OWASP 十大漏洞中的很大一部分。

短期内，我们将把 SQL 注入检测贡献给开源 Pixie 项目，作为 Pixie 内置 SQL 解析器的一部分。我们还将概念证明扩展到跨站脚本攻击（XSS）和服务器端请求伪造（SSRF）攻击。

在中期，我们想用机器学习检测代替正则表达式规则集方法。Pixie 团队已经为机器学习方法奠定了基础；我们将能够利用 PxL现有的 Tensorflow 模型支持[5]。从长远来看，我们正在设计一个基于可观察性的安全产品，它将运行在开源构建块上。

因为这一长期愿景需要一段时间才能实现，所以我们将留给你 SQL 注入概念验证的方法。你可以深入研究源代码并使用此演示仓库[6]在易受攻击的应用程序上对其进行测试。

所以，启动你的开发环境，准备把怪物变成蒲公英。

用于识别潜在 SQL 注入的 PxL 脚本

PxL 脚本通过将查询与一组简单的正则表达式进行匹配来标识 SQL 注入。每一个正则表达式都与特定的 SQL 注入规则相关联。例如，如果查询包含注释（--），那么它将被标记为 SQL 注入攻击，并且违反了注释破折号规则，在数据表中表示为 RULE_BROKEN。

SQL 注入表

然而，在现实世界中，正则表达式是相当容易逃避的，攻击者通常会从这样的尝试开始，看看是否存在漏洞。该规则集捕获了许多攻击者的第一次尝试。如果你想知道有人在尝试更复杂的东西之前是否正在探测你的系统的漏洞，这可能很方便——这就是我们计划将这些规则构建到 Pixie SQL 解析器中的原因。

代码语言：javascript复制

# Rule set to capture some obvious attempts.
SCRIPT_TAG_RULE = "(<|<)s*[sS][cC][rR][iI][pP][tT]"
COMMENT_DASH_RULE = "--"
COMMENT_SLASH_RULE = "/*"
SEMICOLON_RULE = ";. "
UNMATCHED_QUOTES_RULE = "^([^']*'([^']*'[^']*')*[^']*')[^']*'[^']*$"
UNION_RULE = "UNION"
CHAR_CASTING_RULE = "[cC][hH][rR]((|( "cC][hH][rR")"
SYSTEM_CATALOG_ACCESS_RULE = "[fF][rR][oO][mM]s [pP][gG]_"

以下是完整的 PxL 脚本：

代码语言：javascript复制

# Copyright 2018- The Pixie Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# SPDX-License-Identifier: Apache-2.0
''' PostgreSQL Data Tracer
Shows the most recent PostgreSQL messages in the cluster.
'''
import px
SCRIPT_TAG_RULE = "(<|<)s*[sS][cC][rR][iI][pP][tT]"
COMMENT_DASH_RULE = "--"
COMMENT_SLASH_RULE = "/*"
SEMICOLON_RULE = ";. "
UNMATCHED_QUOTES_RULE = "^([^']*'([^']*'[^']*')*[^']*')[^']*'[^']*$"
UNION_RULE = "UNION"
CHAR_CASTING_RULE = "[cC][hH][rR]((|( "cC][hH][rR")"
SYSTEM_CATALOG_ACCESS_RULE = "[fF][rR][oO][mM]s [pP][gG]_"
# google re2 doesn't support backreferences
# ALWAYS_TRUE_RULE = "ORs (['w] )=1"
def add_sql_injection_rule(df, rule_name, rule):
    df[rule_name] = px.regex_match(".*"   rule   ".*", df.req)
    return df
def sql_injections(df):
    df = add_sql_injection_rule(df, 'script_tag', SCRIPT_TAG_RULE)
    df = add_sql_injection_rule(df, 'comment_dashes', COMMENT_DASH_RULE)
    df = add_sql_injection_rule(df, 'comment_slash_star', COMMENT_SLASH_RULE)
    df = add_sql_injection_rule(df, 'semicolon', SEMICOLON_RULE)
    df = add_sql_injection_rule(df, 'unmatched_quotes', UNMATCHED_QUOTES_RULE)
    df = add_sql_injection_rule(df, 'union', UNION_RULE)
    df = add_sql_injection_rule(df, 'char_casting', CHAR_CASTING_RULE)
    df = add_sql_injection_rule(df, 'system_catalog_access', SYSTEM_CATALOG_ACCESS_RULE)
    df = df[
        df.script_tag or (df.comment_dashes or (df.comment_slash_star or (df.semicolon or (
                df.unmatched_quotes or (df.union or (df.char_casting or df.system_catalog_access))))))]
    df.rule_broken = px.select(df.script_tag, 'script_tag',
                               px.select(df.comment_dashes, 'comment_dashes',
                                         px.select(df.comment_slash_star, 'comment_slash_star',
                                                   px.select(df.unmatched_quotes, 'unmatched_quotes',
                                                             px.select(df.union, 'union',
                                                                       px.select(df.char_casting, 'char_casting',
                                                                                 px.select(df.system_catalog_access,
                                                                                           'system_catalog_access',
                                                                                           px.select(df.semicolon,
                                                                                                     'semicolon',
                                                                                                     'N/A'))))))))
    return df[['time_', 'source', 'destination', 'remote_port', 'req', 'resp', 'latency', 'rule_broken']]
def pgsql_data(start_time: str, source_filter: str, destination_filter: str, num_head: int):
    df = px.DataFrame(table='pgsql_events', start_time=start_time)
    df = add_source_dest_columns(df)
    # Filter out entities as specified by the user.
    df = df[px.contains(df.source, source_filter)]
    df = df[px.contains(df.destination, destination_filter)]
    # Add additional filters below:
    # Restrict number of results.
    df = df.head(num_head)
    df = add_source_dest_links(df, start_time)
    df = df[['time_', 'source', 'destination', 'remote_port', 'req', 'resp', 'latency']]
    return df
def potential_sql_injections(start_time: str, source_filter: str, destination_filter: str, num_head: int):
    df = pgsql_data(start_time, source_filter, destination_filter, num_head)
    df = sql_injections(df)
    return df
def add_source_dest_columns(df):
    ''' Add source and destination columns for the PostgreSQL request.
    PostgreSQL requests are traced server-side (trace_role==2), unless the server is
    outside of the cluster in which case the request is traced client-side (trace_role==1).
    When trace_role==2, the PostgreSQL request source is the remote_addr column
    and destination is the pod column. When trace_role==1, the PostgreSQL request
    source is the pod column and the destination is the remote_addr column.
    Input DataFrame must contain trace_role, upid, remote_addr columns.
    '''
    df.pod = df.ctx['pod']
    df.namespace = df.ctx['namespace']
    # If remote_addr is a pod, get its name. If not, use IP address.
    df.ra_pod = px.pod_id_to_pod_name(px.ip_to_pod_id(df.remote_addr))
    df.is_ra_pod = df.ra_pod != ''
    df.ra_name = px.select(df.is_ra_pod, df.ra_pod, df.remote_addr)
    df.is_server_tracing = df.trace_role == 2
    df.is_source_pod_type = px.select(df.is_server_tracing, df.is_ra_pod, True)
    df.is_dest_pod_type = px.select(df.is_server_tracing, True, df.is_ra_pod)
    # Set source and destination based on trace_role.
    df.source = px.select(df.is_server_tracing, df.ra_name, df.pod)
    df.destination = px.select(df.is_server_tracing, df.pod, df.ra_name)
    # Filter out messages with empty source / destination.
    df = df[df.source != '']
    df = df[df.destination != '']
    df = df.drop(['ra_pod', 'is_ra_pod', 'ra_name', 'is_server_tracing'])
    return df
def add_source_dest_links(df, start_time: str):
    ''' Modifies the source and destination columns to display deeplinks in the UI.
    Clicking on a pod name in either column will run the px/pod script for that pod.
    Clicking on an IP address, will run the px/net_flow_graph script showing all
    network connections to/from that address.
    Input DataFrame must contain source, destination, is_source_pod_type,
    is_dest_pod_type, and namespace columns.
    '''
    # Source linking. If source is a pod, link to px/pod. If an IP addr, link to px/net_flow_graph.
    df.src_pod_link = px.script_reference(df.source, 'px/pod', {
        'start_time': start_time,
        'pod': df.source
    })
    df.src_link = px.script_reference(df.source, 'px/net_flow_graph', {
        'start_time': start_time,
        'namespace': df.namespace,
        'from_entity_filter': df.source,
        'to_entity_filter': '',
        'throughput_filter': '0.0'
    })
    df.source = px.select(df.is_source_pod_type, df.src_pod_link, df.src_link)
    # If destination is a pod, link to px/pod. If an IP addr, link to px/net_flow_graph.
    df.dest_pod_link = px.script_reference(df.destination, 'px/pod', {
        'start_time': start_time,
        'pod': df.destination
    })
    df.dest_link = px.script_reference(df.destination, 'px/net_flow_graph', {
        'start_time': start_time,
        'namespace': df.namespace,
        'from_entity_filter': '',
        'to_entity_filter': df.destination,
        'throughput_filter': '0.0'
    })
    df.destination = px.select(df.is_dest_pod_type, df.dest_pod_link, df.dest_link)
    df = df.drop(['src_pod_link', 'src_link', 'is_source_pod_type', 'dest_pod_link',
                  'dest_link', 'is_dest_pod_type'])
    return df

下面是一个配套的 vis.json，它会给你一个格式化好的表格：

代码语言：javascript复制

{
  "variables": [
    {
      "name": "start_time",
      "type": "PX_STRING",
      "description": "The relative start time of the window. Current time is assumed to be now.",
      "defaultValue": "-5m"
    },
    {
      "name": "source_filter",
      "type": "PX_STRING",
      "description": "The partial string to match the 'source' column.",
      "defaultValue": ""
    },
    {
      "name": "destination_filter",
      "type": "PX_STRING",
      "description": "The partial string to match the 'destination' column.",
      "defaultValue": ""
    },
    {
      "name": "max_num_records",
      "type": "PX_INT64",
      "description": "Max number of records to show.",
      "defaultValue": "1000"
    }
  ],
  "globalFuncs": [
    {
      "outputName": "potential_sql_injections",
      "func": {
        "name": "potential_sql_injections",
        "args": [
          {
            "name": "start_time",
            "variable": "start_time"
          },
          {
            "name": "source_filter",
            "variable": "source_filter"
          },
          {
            "name": "destination_filter",
            "variable": "destination_filter"
          },
          {
            "name": "num_head",
            "variable": "max_num_records"
          }
        ]
      }
    }
  ],
  "widgets": [
    {
      "name": "Table",
      "position": {
        "x": 0,
        "y": 0,
        "w": 12,
        "h": 4
      },
      "globalFuncOutputName": "potential_sql_injections",
      "displaySpec": {
        "@type": "types.px.dev/px.vispb.Table"
      }
    }
  ]
}

参考资料

[1]

十大 CVE: https://owasp.org/Top10/

[2]

Pixie: https://px.dev/

[3]

eBPF: https://docs.px.dev/about-pixie/pixie-ebpf/

[4]

PxL 脚本: https://docs.px.dev/reference/pxl/

[5]

现有的 Tensorflow 模型支持: https://blog.tensorflow.org/2021/06/leveraging-machine-learning-pixie.html

[6]

演示仓库: https://github.com/pixie-io/pixie-demos/tree/main/sql-injection-demo

https sql 数据库正则表达式黑客

0 人点赞