【python】SAX和DOM处理XML文件

2023-03-07 21:32:35 浏览数 (1)

文章目录

  • 前言
  • SAX模块
    • 用SAX读取XML文件
      • 常用函数
      • SAX解析器
      • SAX事件处理器
    • 用SAX解析XML文件综合代码

前言

SAX和DOM都是用于处理XML文件的技术,但它们的处理方式不同。SAX是一种基于事件驱动的解析方式,它逐行读取XML文件并触发相应的事件加粗样式,从而实现对XML文件的解析。而DOM则是将整个XML文件加载到内存中,形成一棵树形结构,通过对树的遍历来实现对XML文件的解析。两种方式各有优缺点,具体使用哪种方式取决于具体的需求。

SAX模块

SAX模块是一种解析XML文档的方式,它基于事件驱动的模型,逐个解析XML文档中的元素和属性,并触发相应的事件。相比于DOM模型,SAX模型更加轻量级,适用于处理大型XML文档

用SAX读取XML文件

XML.sax是一种Python库,用于解析XML文档。它提供了一种基于事件的API,可以在解析XML文档时触发事件,从而实现对XML文档的解析和处理。

常用函数

make_parser建立并返回一个SAX解析器的XMLReader对象

代码语言:javascript复制
def make_parser(parser_list=()):
    """Creates and returns a SAX parser.解析器

    Creates the first parser it is able to instantiate of the ones
    given in the iterable created by chaining parser_list and
    default_parser_list.  The iterables must contain the names of Python modules containing both a SAX parser and a create_parser function."""

创建它能够实例化的第一个解析器在通过链接 parser _ list 和Default _ parser _ list: 迭代程序必须包含同时包含 SAX 解析器和 create _ parser 函数的 Python 模块的名称。


parse建立一个SAX解析器,并用它来解析XML文档

代码语言:javascript复制
def parse(source, handler, errorHandler=ErrorHandler()):
    parser = make_parser()
    parser.setContentHandler(handler)
    parser.setErrorHandler(errorHandler)
    parser.parse(source)

parseString与parse函数类似,但从string参数所提供的字符串中解析XML

代码语言:javascript复制
def parseString(string, handler, errorHandler=ErrorHandler()):

SAXException封装了XML操作相关错误或警告

代码语言:javascript复制
class SAXException(Exception):
    """Encapsulate an XML error or warning. This class can contain
    basic error or warning information from either the XML parser or
    the application: you can subclass子类 it to provide additional
    functionality, or to add localization. Note that although you will
    receive a SAXException as the argument to the handlers in the
    ErrorHandler interface, you are not actually required to raise
    the exception; instead, you can simply read the information in
    it."""

SAX解析器

主要作用是:向事件处理器发送时间

SAX事件处理器

ContentHandler类来实现

代码语言:javascript复制
# ===== CONTENTHANDLER =====

class ContentHandler:
    """Interface for receiving logical document content events.

    This is the main callback interface in SAX, and the one most
    important to applications. The order of events in this interface
    mirrors the order of the information in the document."""

此接口中事件的顺序反映了文档中信息的顺序。

代码语言:javascript复制
class ContentHandler:
    """Interface for receiving logical document content events.

    This is the main callback interface in SAX, and the one most
    important to applications. The order of events in this interface
    mirrors the order of the information in the document."""

    def __init__(self):
        self._locator = None定位器

    def setDocumentLocator(self, locator):
        """Called by the parser to give the application a locator for
        locating the origin of document events.由解析器调用,为应用程序提供一个定位文档事件的起源。

        SAX parsers are strongly encouraged 鼓励(though not absolutely
        required虽然不是绝对必需的) to supply提供 a locator: if it does so, it must supply
        the locator to the application by invoking this method before
        invoking调用 any of the other methods in the DocumentHandler
        interface.

        The locator allows the application to determine the end
        position of any document-related event, even if the parser is
        not reporting an error. Typically, the application will use
        this information for reporting its own errors (such as
        character content that does not match an application's
        business rules). The information returned by the locator is
        probably not sufficient for use with a search engine.

        Note that the locator will return correct information only
        during the invocation 调用of the events in this interface. The
        application should not attempt to use it at any other time."""
        self._locator = locator

    def startDocument(self):
        """Receive notification of the beginning of a document.

        The SAX parser will invoke this method only once, before any
        other methods in this interface or in DTDHandler (except for
        setDocumentLocator)."""

    def endDocument(self):
        """Receive notification of the end of a document.

        The SAX parser will invoke this method only once, and it will
        be the last method invoked during the parse. The parser shall
        not invoke this method until it has either abandoned parsing
        (because of an unrecoverable error) or reached the end of
        input."""

    def startPrefixMapping(self, prefix, uri):
        """Begin the scope of a prefix-URI Namespace mapping.
开始了prefix-URI名称空间映射的范围。
        The information from this event is not necessary for normal
        Namespace processing: the SAX XML reader will automatically
        replace prefixes for element and attribute names when the
        http://xml.org/sax/features/namespaces feature is true (the
        default).

        There are cases, however, when applications need to use
        prefixes in character data or in attribute values, where they
        cannot safely be expanded automatically; the
        start/endPrefixMapping event supplies the information to the
        application to expand prefixes in those contexts itself, if
        necessary.

        Note that start/endPrefixMapping events are not guaranteed to
        be properly nested relative to each-other: all
        startPrefixMapping events will occur before the corresponding
        startElement event, and all endPrefixMapping events will occur
        after the corresponding endElement event, but their order is
        not guaranteed."""

    def endPrefixMapping(self, prefix):
        """End the scope of a prefix-URI mapping映射.

        See startPrefixMapping for details. This event will always
        occur after the corresponding endElement event, but the order
        of endPrefixMapping events is not otherwise guaranteed.不以其他方式保证"""

    def startElement(self, name, attrs):
        """Signals the start of an element in non-namespace mode.

        The name parameter contains the raw XML 1.0 name of the
        element type as a string and the attrs parameter holds an
        instance of the Attributes class containing the attributes of
        the element."""

    def endElement(self, name):
        """Signals the end of an element in non-namespace mode.

        The name parameter contains the name of the element type, just
        as with the startElement event."""

    def startElementNS(self, name, qname, attrs):
        """Signals the start of an element in namespace mode.

        The name parameter contains the name of the element type as a
        (uri, localname) tuple, the qname parameter the raw XML 1.0
        name used in the source document, and the attrs parameter
        holds an instance of the Attributes class containing the
        attributes of the element.

        The uri part of the name tuple is None for elements which have
        no namespace."""

    def endElementNS(self, name, qname):
        """Signals the end of an element in namespace mode.

        The name parameter contains the name of the element type, just
        as with the startElementNS event."""

    def characters(self, content):
        """Receive notification of character data.

        The Parser will call this method to report each chunk of
        character data. SAX parsers may return all contiguous
        character data in a single chunk, or they may split it into
        several chunks; however, all of the characters in any single
        event must come from the same external entity so that the
        Locator provides useful information."""

    def ignorableWhitespace(self, whitespace):
        """Receive notification of ignorable whitespace in element content.

        Validating Parsers must use this method to report each chunk
        of ignorable whitespace (see the W3C XML 1.0 recommendation,
        section 2.10): non-validating parsers may also use this method
        if they are capable of parsing and using content models.

        SAX parsers may return all contiguous whitespace in a single
        chunk, or they may split it into several chunks; however, all
        of the characters in any single event must come from the same
        external entity, so that the Locator provides useful
        information."""

    def processingInstruction(self, target, data):
        """Receive notification of a processing instruction.

        The Parser will invoke this method once for each processing
        instruction found: note that processing instructions may occur
        before or after the main document element.

        A SAX parser should never report an XML declaration (XML 1.0,
        section 2.8) or a text declaration (XML 1.0, section 4.3.1)
        using this method."""

    def skippedEntity(self, name):
        """Receive notification of a skipped entity.实体

        The Parser will invoke this method once for each entity
        skipped. Non-validating processors may skip entities if they
        have not seen the declarations (because, for example, the
        entity was declared in an external DTD subset). All processors
        may skip external entities, depending on the values of the
        http://xml.org/sax/features/external-general-entities and the
        http://xml.org/sax/features/external-parameter-entities
        properties."""


# ===== DTDHandler =====

用SAX解析XML文件综合代码

SAX_parse_XML.py

代码语言:javascript复制
# coding=gbk
import xml.sax
import sys
get_record=[] # 接受获取xml文档数据
class GetStorehouse(xml.sax.ContentHandler):# 事件处理器
    def __init__(self):
        self.CurrentDate=""# 自定义当前元素标签名属性
        self.title=""# 自定义商品二级分类属性
        self.name=""
        self.amount=""
        self.price=""

    def startElement(self,label,atrributes):# 遇到元素开始标签出发该函数
        self.CurrentDate=label # label为实例对象在解析的时候传递的标签名
        if label=="goods":
            category=atrributes["category"]
            return category
    def endElement(self,label):
        global get_record
        if self.CurrentDate=="title":
            get_record.append(self.title)
        elif self.CurrentDate=="name":
            get_record.append(self.name)
        elif self.CurrentDate=="amount":
            get_record.append(self.amount)
        elif self.CurrentDate=="price":
            get_record.append(self.price)
    def characters(self,content):
        if self.CurrentDate=="title":
            self.title=content
        elif self.CurrentDate=="name":
            self.name=content
        elif self.CurrentDate=="amount":
            self.amount=content
        elif self.CurrentDate=="price":
            self.price=content

#=======
parser=xml.sax.make_parser()#创建一个解析器的XMLreader对象
parser.setFeature(xml.sax.handler.feature_namespaces,0)# 从xml文件解析数据,关闭从命名空间解析数据
Handler=GetStorehouse()
parser.setContentHandler(Handler)
parser.parse("storehouse.xml")
print(get_record)
代码语言:javascript复制
['淡水鱼', '鲫鱼', '18', '8', '    ', '温带水果', '猕猴桃', '10', '10', '    ', 'n']
代码语言:javascript复制
<storehouse>
    <goods category="fish">
        <title>淡水鱼</title>
        <name>鲫鱼</name>
        <amount>18</amount>
        <price>8</price>
    </goods>
    <goods category="fruit">
        <title>温带水果</title>
        <name>猕猴桃</name>
        <amount>10</amount>
        <price>10</price>
    </goods>
</storehouse>

0 人点赞