《流畅的Python》第四章学习笔记

一个字符串是一个字符序列

字节序列:机器磁芯转储

Unicode:人类可读的本文

把字节序列变成人类可读的文本字符串就是解码「decode」

把字符串变成用于存储或传输的字节序列激素编码「encode」

Python3的「str」类型基本相当于Python2的「unicode」类型

Python3默认使用「UTF-8」编码

Pyhon2默认使用ASCII

编解码

代码语言：javascript复制

def encode(self, *args, **kwargs): # real signature unknown
    """
    Encode the string using the codec registered for encoding.

      encoding「参数1:编码」
        The encoding in which to encode the string.
      errors「参数2:错误处理方案」
        The error handling scheme to use for encoding errors.
        The default is 'strict' meaning that encoding errors raise a
        UnicodeEncodeError.  Other possible values are 'ignore', 'replace' and
        'xmlcharrefreplace' as well as any other name registered with
        codecs.register_error that can handle UnicodeEncodeErrors.
    """
    pass

代码语言：javascript复制

def decode(self, *args, **kwargs): # real signature unknown
    """
    Decode the bytearray using the codec registered for encoding.

      encoding
        The encoding with which to decode the bytearray.
      errors
        The error handling scheme to use for the handling of decoding errors.
        The default is 'strict' meaning that decoding errors raise a
        UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
        as well as any other name registered with codecs.register_error that
        can handle UnicodeDecodeErrors.
    """
    pass

错误处理方案

编解码器可以通过接受 errors 字符串参数来实现不同的错误处理方案。

代码语言：javascript复制

引发 UnicodeError (或其子类)；这是默认的方案。在 strict_errors() 中实现。

以下错误处理方案仅适用于文本编码:

代码语言：javascript复制

使用适当的替换标记进行替换；Python 内置编解码器将在解码时使用官方 U FFFD 替换字符，而在编码时使用 '?' 。在 replace_errors() 中实现。

此外，以下错误处理方案被专门用于指定的编解码器：

值	编解码器	含义
'surrogatepass'	utf-8, utf-16, utf-32, utf-16-be, utf-16-le, utf-32-be, utf-32-le	允许编码和解码代理代码。这些编解

自行定义编码错误处理方案

代码语言：javascript复制

codes.register_error(name,error_handler)

name:名称
error_handler:错误处理函数

自定义错误处理

判断字符串编码

代码语言：javascript复制

import chardet

print(chardet.detect(b'aaaa'))
# {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
print(chardet.detect(b'xfexffxffxffx00x00x01x00x02x00'))
# {'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}

通过chardet模块可以判断出内容的编码方式

代码语言：javascript复制

import locale

print(locale.getpreferredencoding())  # UTF-8

BOM

在Windows上使用open打开utf-8编码的txt文件时开头会有一个多余的字符ufeff，它叫BOM,是用来声明编码等信息的,但python会把它当作文本解析。

对UTF-16, Python将BOM解码为空字串。

对UTF-8, BOM被解码为一个字符ufeff。

Unicode三明治-目前处理文本的最佳实践

「bytest」->「str」解码输入的字节序列
「str」只处理文本
「str」->「bytest」编码输出的文本

⚠️需要在多台设备或者多种场景下运行的代码，一定不能依赖「默认编码」。

规范化文本匹配

代码语言：javascript复制

unicodedata.normalize(form,unistr)

normalize

代码语言：javascript复制

import unicodedata


def nfc_equal(s1, s2):
    print(unicodedata.normalize('NFC', s1))
    print(unicodedata.normalize('NFC', s2))
    return unicodedata.normalize('NFC', s1) == unicodedata.normalize('NFC', s2)


def fold_equal(s1, s2):
    print(unicodedata.normalize('NFC', s1).casefold())
    print(unicodedata.normalize('NFC', s2).casefold())
    return unicodedata.normalize('NFC', s1).casefold() == unicodedata.normalize('NFC', s2).casefold()


s1 = 'café'
s2 = 'cafeu0301'
print(s1, s2)  # café café
print(nfc_equal(s1, s2))
# café
# café
# True
print(nfc_equal('A', 'a'))
# A
# a
# False
print(fold_equal('A', 'a'))
# a
# a
# True

python 编程算法 unicode

0 人点赞