一区二区三区日韩精品-日韩经典一区二区三区-五月激情综合丁香婷婷-欧美精品中文字幕专区

分享

[筆記]Delphi實現(xiàn)獲取字符串相似度

 npkaida 2016-03-21

維基百科對字符串相似度(Damerau–Levenshtein distance)的定義是:

In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a "distance" (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or atransposition of two adjacent characters. In his seminal paper[1], Damerau not only distinguished these four edit operations but also stated that they correspond to more than 80% of all human misspellings. Damerau's paper considered only misspellings that could be corrected with at most one edit operation. The corresponding edit distance, i.e., dealing with multiple edit operations, known as the Levenshtein distance, was introduced by Levenshtein,[2] but it did not include transpositions in the set of basic operations. The name Damerau–Levenshtein distance is used to refer to the edit distance that allows multiple edit operations including transpositions, although it is not clear whether the term Damerau–Levenshtein distanceis sometimes used in some sources as to take into account non-adjacent transpositions or not.

簡單翻譯下,兩個字符序列的DL距離,就是從一個變換到另一個的最小操作次數(shù)。這個變換包括插入一個字符、刪除一個字符、替換一個字符、或互換兩個相鄰字符。

而所謂“編輯距離(edit distance,或叫Levenshtein distance)”,并不包含互換兩個相鄰字符。

主要應(yīng)用是在字符拼寫檢查上,當(dāng)然也可以用在其他地方,比方不少輸入法就提供類似的校正功能(搜狗拼音輸入法即實現(xiàn)了此功能)。

看起來簡單,實現(xiàn)還是有一定困難的,好在有牛人已經(jīng)做好相應(yīng)的函數(shù),如 KambizHow to match two strings approximately 中提供了兩個函數(shù):

計算DL距離的函數(shù)DamerauLevenshteinDistance(Str1, Str2)

function DamerauLevenshteinDistance(const Str1, Str2: string): Integer;

var

  LenStr1, LenStr2: Integer;

  I, J, T, Cost, Minimum: Integer;

  pStr1, pStr2, S1, S2: PChar;

  D, RowPrv2, RowPrv1, RowCur, Temp: PIntegerArray;

begin

  LenStr1 := Length(Str1);

  LenStr2 := Length(Str2);



  // to save some space, make sure the second index points to the shorter string

  if LenStr1 < LenStr2 then begin

    T := LenStr1;

    LenStr1 := LenStr2;

    LenStr2 := T;

    pStr1 := PChar(Str2);

    pStr2 := PChar(Str1);

  end

  else begin

    pStr1 := PChar(Str1);

    pStr2 := PChar(Str2);

  end;



  // to save some time and space, look for exact match

  while (LenStr2 <> 0) and (pStr1^ = pStr2^) do begin

    Inc(pStr1);

    Inc(pStr2);

    Dec(LenStr1);

    Dec(LenStr2);

  end;



  // when one string is empty, length of the other is the distance

  if LenStr2 = 0 then begin

    Result := LenStr1;

    Exit;

  end;



  // calculate the edit distance

  T := LenStr2 + 1;

  GetMem(D, 3 * T * SizeOf(Integer));

  FillChar(D^, 2 * T * SizeOf(Integer), 0);

  RowCur := D;

  RowPrv1 := @D[T];

  RowPrv2 := @D[2 * T];

  S1 := pStr1;



  for I := 1 to LenStr1 do begin

    Temp := RowPrv2;

    RowPrv2 := RowPrv1;

    RowPrv1 := RowCur;

    RowCur := Temp;

    RowCur[0] := I;

    S2 := pStr2;



    for J := 1 to LenStr2 do begin

      Cost := Ord(S1^ <> S2^);

      Minimum := RowPrv1[J - 1] + Cost;                 // substitution

      T := RowCur[J - 1] + 1;                           // insertion



      if T < Minimum then Minimum := T;



      T := RowPrv1[J] + 1;                              // deletion



      if T < Minimum then Minimum := T;



      if (I <> 1) and (J <> 1) and (S1^ = (S2 - 1)^) and (S2^ = (S1 - 1)^) then begin

        T := RowPrv2[J - 2] + Cost;                     // transposition



        if T < Minimum then Minimum := T;

      end;



      RowCur[J] := Minimum;

      Inc(S2);

    end;



    Inc(S1);

  end;



  Result := RowCur[LenStr2];

  FreeMem(D);

end;

還有計算字符串相似度的函數(shù) StringSimilarityRatio(Str1, Str2, IgnoreCase): Double;

返回值在0到1之間,0表示不相似,1表示完全相似。

function StringSimilarityRatio(const Str1, Str2: string; IgnoreCase: Boolean): Double;

var

  MaxLen: Integer;

  Distance: Integer;

begin

  Result := 1.0;



  if Length(Str1) > Length(Str2) then

    MaxLen := Length(Str1)

  else

    MaxLen := Length(Str2);



  if MaxLen <> 0 then begin

    if IgnoreCase then

      Distance := DamerauLevenshteinDistance(LowerCase(Str1), LowerCase(Str2))

    else

      Distance := DamerauLevenshteinDistance(Str1, Str2);



    Result := Result - (Distance / MaxLen);

  end;

end;

后來data man 參考一個德國人的ApproxStrUtils單元(該單元計算的是L距離,不是DL距離)給出一個據(jù)說效率更高的DL距離函數(shù),遺憾的是調(diào)用它會有“Invalid Pointer Operation”的報錯,還沒有Debug出問題所在,所以暫時先用前一個版本吧。

function DamerauLevenshteinDistance2(const Str1, Str2: string): Integer;

  function Min(const A, B, C: Integer): Integer; inline;

  begin

    Result := A;

    if B < A then

      Result := B;

    if C < Result then

      Result := C;

  end;



var

  LenStr1, LenStr2: Integer;

  I, J, T, Cost, PrevCost: Integer;

  pStr1, pStr2, S1, S2: PChar;

  D: PIntegerArray;

begin

  LenStr1 := Length(Str1);

  LenStr2 := Length(Str2);



  // to save some space, make sure the second index points to the shorter string

  if LenStr1 < LenStr2 then begin

    T := LenStr1;

    LenStr1 := LenStr2;

    LenStr2 := T;

    pStr1 := PChar(Str2);

    pStr2 := PChar(Str1);

  end

  else begin

    pStr1 := PChar(Str1);

    pStr2 := PChar(Str2);

  end;



  // to save some time and space, look for exact match

  while (LenStr2 <> 0) and (pStr1^ = pStr2^) do begin

    Inc(pStr1);

    Inc(pStr2);

    Dec(LenStr1);

    Dec(LenStr2);

  end;



  while (LenStr2 <> 0) and ((pStr1 + LenStr1 - 1)^ = (pStr2 + LenStr2 - 1)^) do begin

    Dec(LenStr1);

    Dec(LenStr2);

  end;



  if LenStr2 = 0 then begin

    Result := LenStr1;

    Exit;

  end;



  // calculate the edit distance

  T := LenStr2 + 1;

  GetMem(D, T * SizeOf(Integer));



  for I := 0 to T do D[I] := I;



  S1 := pStr1;

  for I := 1 to LenStr1 do begin

    PrevCost := I - 1;

    Cost := I;

    S2 := pStr2;



    for J := 1 to LenStr2 do begin

      if (S1^ = S2^) or ((I > 1) and (J > 1) and (S1^ = (S2 - 1)^) and (S2^ = (S1 - 1)^)) then

        Cost := PrevCost

      else

        Cost := 1 + min(Cost, PrevCost, D[J]);



      PrevCost := D[J];

      D[J] := Cost;

      Inc(S2);

    end;



    Inc(S1);

  end;



  Result := D[LenStr2];

  FreeMem(D);

end;

參考文獻:

  1. Damerau–Levenshtein_distance
    http://en./wiki/Damerau%E2%80%93Levenshtein_distance
  2. How to match two strings approximately
    http://www./articles/how-to-match-two-strings-approximately/
  3. Fuzzy string matching
    www./articles/how-to-match-two-strings-approximately
  4. Fuzzy search in strings
    http://www./approxstrutils-en.html
Technorati 標(biāo)簽: , , , , , ,

[筆記]Delphi實現(xiàn)獲取字符串相似度

    本站是提供個人知識管理的網(wǎng)絡(luò)存儲空間,所有內(nèi)容均由用戶發(fā)布,不代表本站觀點。請注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息,謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請點擊一鍵舉報。
    轉(zhuǎn)藏 分享 獻花(0

    0條評論

    發(fā)表

    請遵守用戶 評論公約

    類似文章 更多

    日本高清一区免费不卡| 东京热男人的天堂一二三区| 国产精品成人一区二区三区夜夜夜 | 好吊日成人免费视频公开| 91插插插外国一区二区婷婷| 视频一区二区 国产精品| 少妇人妻精品一区二区三区| 人妻少妇久久中文字幕久久| 日韩一区二区三区久久| 视频一区中文字幕日韩| 亚洲伦片免费偷拍一区| 日本乱论一区二区三区| 九九热精品视频免费观看| 99久久精品午夜一区二| 中国日韩一级黄色大片| 久久青青草原中文字幕| 激情综合网俺也狠狠地| 国产精品内射婷婷一级二级| 欧美日本亚欧在线观看| 国产精品国三级国产专不卡| 久久99这里只精品热在线| 亚洲最新中文字幕在线视频| 精品国产亚洲一区二区三区| 国产精品国三级国产专不卡| 亚洲一区二区三区在线免费 | 大香蕉网国产在线观看av| 欧洲一区二区三区蜜桃| 国产精品一区二区不卡中文| 欧美一区日韩一区日韩一区| 欧美日韩欧美国产另类| 中文字幕久热精品视频在线| 日韩性生活视频免费在线观看 | 欧美欧美日韩综合一区| 一级片二级片欧美日韩| 黄色污污在线免费观看| 91欧美一区二区三区成人| 欧美精品女同一区二区| 欧美日韩国产一级91| 国产精品白丝一区二区| 亚洲欧美一二区日韩高清在线| 色哟哟在线免费一区二区三区|