Damerau-Levenshtein算法实现中的错误及更正

2024-10-09 10:29:44 浏览数 (2)

在实现 Damerau-Levenshtein 算法 时,常见的错误包括边界条件处理不当、转置操作的遗漏或误用、矩阵初始化错误等。Damerau-Levenshtein 算法是 Levenshtein 编辑距离的扩展,它不仅允许插入、删除和替换,还允许 相邻字符的转置。该算法计算两个字符串之间的编辑距离,考虑到这四种操作的最小代价。

以下是一个典型的 Damerau-Levenshtein 算法的 Python 实现,以及可能出现的错误和更正方法。

问题背景:

  1. 一个Python用户在Stack Overflow上发帖抱怨他实现的Damerau-Levenshtein 算法的 Cython版本速度很快,但结果不正确。
  2. 他在debug过程中发现问题似乎出在算法中用于记录编辑距离的行其中一行被错误地填满了1,而参考方法中,这一行中的值是正确的。
  3. 此外,他还遇到了另一个问题:用 malloc 分配空间给三个数组 twoago、oneago 和 thisrow 后,在循环中对它们进行轮换时,free( twoago ) 等操作会出现 double free 或内存损坏的错误。

解决方案:

  1. 对于第一个问题,问题出在循环中对数组 thisrow 的更新方式。在原始代码中,thisrow 的每一行都是通过取前一行的数据然后加1来初始化的。这种初始化方式导致 thisrow 中的所有行都包含相同的数据,因此算法无法正确计算编辑距离。
  2. 正确的初始化方式应该是只初始化 thisrow 的最后一列,其他列的值则通过计算获得。
  3. 对于第二个问题,之所以会出现 double free 或内存损坏的错误,是因为在循环中对数组的轮换方式有问题。原始代码中,twoago、oneago 和 thisrow 三个数组通过以下方式进行轮换:
代码语言:javascript复制
twoago, oneago = oneago, thisrow
  1. 这会导致在释放数组时出现问题,因为数组实际上指向同一个内存区域,释放两次就会导致 double free 错误。
  2. 正确的轮换方式应该是:
代码语言:javascript复制
twoago, oneago, thisrow = oneago, thisrow, twoago

代码示例:

以下是更正后的 Cython 代码:

代码语言:javascript复制
cdef unsigned int _minimum_of_two_uints( unsigned int a, unsigned int b ):
  if a < b: return a
  return b
​
#-----------------------------------------------------------------------------------------------------------
cdef unsigned int _minimum_of_three_uints( unsigned int a, unsigned int b, unsigned int c ):
  if a < b:
    if c < a:
      return c
    return a
  if c < b:
    return c
  return b
​
#-----------------------------------------------------------------------------------------------------------
cdef inline int _warp( unsigned int limit, int value ):
  return value if value >= 0 else limit   value
​
############################################################################################################
# ARRAYS THAT SAY SIZE ;-)
#-----------------------------------------------------------------------------------------------------------
cdef class Array_of_unsigned_int:
  cdef unsigned int *data
  cdef unsigned int length
​
  #---------------------------------------------------------------------------------------------------------
  def __cinit__( self, unsigned int length, fill_value = None ):
    self.length = length
    self.data   = <unsigned int *>malloc( length * sizeof( unsigned int ) )  ###OBS### must check malloc doesn't return NULL pointer
    if fill_value is not None:
      self.fill( fill_value )
​
  #---------------------------------------------------------------------------------------------------------
  cdef fill( self, unsigned int value ):
    cdef unsigned int idx
    cdef unsigned int *d    = self.data
    for idx from 0 <= idx < self.length:
      d[ idx ] = value
​
  #---------------------------------------------------------------------------------------------------------
  cdef resize( self, unsigned int length ):
    self.data   = <unsigned int *>realloc( self.data, length * sizeof( unsigned int ) )  ###OBS### must check realloc doesn't return NULL pointer
    self.length = length
​
  #---------------------------------------------------------------------------------------------------------
  def free( self ):
    """Always remember the milk: Free up memory."""
    free( self.data )  ###OBS### should free memory here
​
  #---------------------------------------------------------------------------------------------------------
  def as_list( self ):
    """Return the array as a Python list."""
    R                       = []
    cdef unsigned int idx
    cdef unsigned int *d    = self.data
    for idx from 0 <= idx < self.length:
      R.append( d[ idx ] )
    return R
​
​
############################################################################################################
# CONVERTING UNICODE TO CHARACTER IDs (CIDs)
#---------------------------------------------------------------------------------------------------------
cdef unsigned int _UMX_surrogate_lower_bound    = 0x10000
cdef unsigned int _UMX_surrogate_upper_bound    = 0x10ffff
cdef unsigned int _UMX_surrogate_hi_lower_bound = 0xd800
cdef unsigned int _UMX_surrogate_hi_upper_bound = 0xdbff
cdef unsigned int _UMX_surrogate_lo_lower_bound = 0xdc00
cdef unsigned int _UMX_surrogate_lo_upper_bound = 0xdfff
cdef unsigned int _UMX_surrogate_foobar_factor  = 0x400
​
#---------------------------------------------------------------------------------------------------------
cdef Array_of_unsigned_int _cids_from_text( text ):
  """Givn a ``text`` either as a Unicode string or as a ``bytes`` or ``bytearray``, return an instance of
  ``Array_of_unsigned_int`` that enumerates either the Unicode codepoints of each character or the value of
  each byte. Surrogate pairs will be condensed into single values, so on narrow Python builds the length of
  the array returned may be less than ``len( text )``."""
  #.........................................................................................................
  # Make sure ``text`` is either a Unicode string (``str``) or a ``bytes``-like thing:
  is_bytes = isinstance( text, ( bytes, bytearray, ) )
  assert is_bytes or isinstance( text, str ), '#121'
  #.........................................................................................................
  # Whether it is a ``str`` or a ``bytes``, we know the result can only have at most as many elements as
  # there are characters in ``text``, so we can already reserve that much space (in the case of a Unicode
  # text, there may be fewer CIDs if there happen to be surrogate characters):
  cdef unsigned int           length  = <unsigned int>len( text )
  cdef Array_of_unsigned_int  R       = Array_of_unsigned_int( length )
  #.........................................................................................................
  # If ``text`` is empty, we can return an empty array right away:
  if length == 0: return R
  #.........................................................................................................
  # Otherwise, prepare to copy data:
  cdef unsigned int idx               = 0
  #.........................................................................................................
  # If ``text`` is a ``bytes``-like thing, use simplified processing; we just have to copy over all byte
  # values and are done:
  if is_bytes:
    for idx from 0 <= idx < length:
      R.data[ idx ] = <unsigned int>text[ idx ]
    return R
  #.........................................................................................................
  cdef unsigned int cid               = 0
  cdef bool         is_surrogate      = False
  cdef unsigned int hi                = 0
  cdef unsigned int lo                = 0
  cdef unsigned int chr_count         = 0
  #.........................................................................................................
  # Iterate over all indexes in text:
  for idx from 0 <= idx < length:
    #.......................................................................................................
    # If we met with a surrogate CID in the last cycle, then that was a high surrogate CID, and the
    # corresponding low CID is on the current position. Having both, we can compute the intended CID
    # and reset the flag:
    if is_surrogate:
      lo = <unsigned int>ord( text[ idx ] )
      # IIRC, this formula was documented in Unicode 3:
      cid = ( ( hi - _UMX_surrogate_hi_lower_bound ) * _UMX_surrogate_foobar_factor
              ( lo - _UMX_surrogate_lo_lower_bound )   _UMX_surrogate_lower_bound )
      is_surrogate = False
    #.......................................................................................................
    else:
      # Otherwise, we retrieve the CID from the current position:
      cid = <unsigned int>ord( text[ idx ] )
      #.....................................................................................................
      if _UMX_surrogate_hi_lower_bound <= cid <= _UMX_surrogate_hi_upper_bound:
        # If this CID is a high surrogate CID, set ``hi`` to this value and set a flag so we'll come back
        # in the next cycle:
        hi                = cid
        is_surrogate      = True
        continue
    #.......................................................................................................
    R.data[ chr_count ] = cid
    chr_count      = 1
  #................................................................................................

总结

  • 常见错误:主要涉及矩阵初始化、转置条件的边界检查以及转置操作的实现错误。
  • 更正:通过检查边界条件、确保字符的相邻性和正确处理转置,算法能够准确计算 Damerau-Levenshtein 编辑距离。

通过这种方式,算法不仅处理标准的编辑操作,还能优雅地处理相邻字符的转置操作。

0 人点赞