Python90-3 bytes和str的区别

第3条：知道bytes和str的区别

Item 3: Know the Differences Between bytes and str

写在前面：我们一般用字符串(str)表示字符序列，但是还有另一种包含原始数据的字符序列，比特序列(bytes)，比特序列可能在网络传输和文件读写时用到。

Python中有2种表示字符序列的类型：bytes和str。

bytes包含原始、无符号8-bit值（称为bytes，通常以ASCII码形式显示）:

注：ASCII码是字符和8bit数字码点(code point)的对应，如'a' 与97对应，二进制形式01100001。

代码语言：javascript复制

a = b'hx65llo'
print(list(a))
print(a)

# output:
# [104, 101, 108, 108, 111]  #bytes, a中字符对应的ASCII码
#  b'hello'

str包含代表文本字符的Unicode编码。

注：Unicode用16bit表示字符。

代码语言：javascript复制

a = 'au0300 propos'
print(list(a))   
print(a)
#
# output:
# ['a', '̀', ' ', 'p', 'r', 'o', 'p', 'o', 's'] #Unicode 编码
# à propos

请注意，str与bytes的编码转换方式不是固定唯一的。为了将Unicode数据转换为binary数据，必须调用str的编码方法。为了将binary数据转换为Unicode数据，必须调用bytes的编码方法。你可以显式指定编码方式，或者使用系统默认方式，通常是UTF-8。

编写Python程序时，在最外层的接口部分，编码或解码Unicode数据非常重要。这个方法一般叫Unicode sandwich（Unicode三明治）。

程序的核心应该使用包含Unicode数据的str类型，并且不要锁定到某种字符的编码方式上面。这样可以让程序接受多种文本编码(例如Latin-1,Shift JIS,Big5)，并把它们都转化为Unicode，输出文本信息最好都是UTf-8编码格式。

Python unicode三明治

（Unicode三明治，程序内Unicode，程序外bytes。图源自网络）

字符类型间的不同导致两种常见情形：

• 你想操作包含UTF-8编码字符串的原始8-bit序列
• 你想操作没有特定编码的Unicode字符串

你需要两种helper方法来进行bytes和str的转换。第一个方法是接受bytes或str，返回str。

代码语言：javascript复制

def to_str(bytes_or_str):
  if isinstance(bytes_or_str, bytes):
      value = bytes_or_str.decode('utf-8')
  else:
      value = bytes_or_str
  return value
print(repr(to_str(b'foo')))
print(repr(to_str('bar')))

第二个方法也是接受bytes或str，但返回bytes。

代码语言：javascript复制

def to_bytes(bytes_or_str):
  if isinstance(bytes_or_str, str):
      value = bytes_or_str.encode('utf-8')
  else:
      value = bytes_or_str
  return value  # Instance of bytes

print(repr(to_bytes(b'foo')))
print(repr(to_bytes('bar')))

处理原始8-bit数据和Unicode字符串时有两个陷阱。第一个问题是bytes和str看起来很像，但是它们不兼容。所以你必须明确字符序列的类型。通过使用运算符，你可以进行bytes bytes，str str的操作：

代码语言：javascript复制

print(b'one'   b'two')
print('one'   'two')

但是str和bytes之间不能相加： b'one' 'two' # 错误的

其它的二元操作符(>,<,==)也类似。另外，格式化字符串中%操作符也类似。

第二个问题是关于文件处理的操作默认Unicode字符串而不是raw bytes。这可能导致意外的错误。例如，我想要写二进制数据到文件中，

代码语言：javascript复制

with open('data.bin', 'w') as f:
    f.write(b'xf1xf2xf3xf4xf5')

将报错 TypeError: write() argument must be str, not bytes 问题在于使用的是文本模式'w'，没用二进制写模式'wb'。文本模式需要包含Unicode数据的str而不是包含二进制数据的bytes。'w'更改为'wb'修复问题。

代码语言：javascript复制

with open('data.bin', 'wb') as f:
    f.write(b'xf1xf2xf3xf4xf5')

读取文件也是类似的情况。'r'模式默认读取文本，需要用'rb'读取二进制。

代码语言：javascript复制

with open('data.bin', 'r') as f:
    data = f.read()
# 编码错误UnicodeDecodeError: 'gbk' codec can't decode byte 0xf5 in position 4: incomplete multibyte sequence

代码语言：javascript复制

with open('data.bin','rb') as f:   #用rb读取二进制
    data = f.read()

也可以通过encoding参数来修正问题（不推荐）：假设encoding方式是cp1252。（cp1252是Windows的一种编码方式）。

代码语言：javascript复制

with open('data.bin', 'r', encoding='cp1252') as f:
    data = f.read()
# 得到 ñòóôõ

编码不匹配时会读取到一堆奇怪的符号，或者出现can't decode的错误。可以使用如下代码检查系统的默认编码方式:

代码语言：javascript复制

import locale
print(locale.getpreferredencoding())

Things to Remember

• Python中有2种表示字符序列的类型：bytes和str。
• bytes是8-bit序列，str是Unicode码点(code points)序列。
• 使用helper方法确保你的操作对象是你期望的(8bit值,UTF-8编码字符串,Unicode码点,etc)。
• bytes和str不能一起用在运算符两端。例如(>,==, ,%)
• 读写二进制文件，使用二进制模式('rb','wb')
• 读写Unicode数据，需要知道系统默认文本编码。显式传入encoding参数。

unicode 编程算法 python

0 人点赞