使用Javascript中的utf-32编码缩短utf-8字符串?

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用Javascript中的utf-32编码缩短utf-8字符串?相关的知识,希望对你有一定的参考价值。

我正在尝试找到一种在javascript中压缩/解压缩字符串的方法。通过压缩我的意思是使字符串看起来更短(更少char)。那是我的目标。

以下是事情应该如何运作的一个例子:

// The string that I want to make shorter
// It will only contain [a-zA-Z0-9] chars and some ponctuations like ()[]{}.,;'"!
var string = "I like bananas !";

// The compressed string, maybe something like "䐓㐛꯱字",
// which is shorter than the original
var shortString = compress(string);  

// The original string, "I like banana !"
var originalString = decompress(shortString);

这是我的第一个想法(也许有更好的方法来达到我的目标,如果是这样,我对它感兴趣)。

我知道我的原始字符串将是utf-8。所以我正在考虑使用utf-32进行编码,它应该将字符串的长度除以4。

但我不知道如何使用不同的编码来构造这两个函数来构造新的字符串。这是我到目前为止的代码不起作用......

function compress(string) {
    string = unescape(encodeURIComponent(string));
    var newString = '';

    for (var i = 0; i < string.length; i++) {
        var char = string.charCodeAt(i);
        newString += parseInt(char, 8).toString(32);
    }

    return newString;
}
答案

由于您使用的是一组少于100个字符且javascript字符串以UTF-16编码(这意味着您有65536个可能的字符),您可以做的是连接字符代码以便拥有一个“压缩”字符每两个基本字符。这允许您将字符串压缩到一半的长度。

像这样例如:

document.getElementById('compressBtn').addEventListener('click', function() {
  var stringToCompress = document.getElementById('tocompress').value;
  var compressedString = compress(stringToCompress);
  var decompressedString = decompress(compressedString);

  if (stringToCompress === decompressedString) {
    document.getElementById('display').innerhtml = stringToCompress + ", length of " + stringToCompress.length  + " characters compressed to " + compressedString + ", length of " + compressedString.length + " characters back to " + decompressedString;
  } else {
    document.getElementById('display').innerHTML = "This string cannot be compressed"
  }

})


function compress(string) {
  string = unescape(encodeURIComponent(string));
  var newString = '',
    char, nextChar, combinedCharCode;

  for (var i = 0; i < string.length; i += 2) {
    char = string.charCodeAt(i);

    if ((i + 1) < string.length) {

      // You need to make sure that you don't have 3 digits second character else you  might go over 65536. 
      // But in UTF-16 the 32 characters aren't in your basic character set. But it's a limitation, anything
      // under charCode 32 will cause an error
      nextChar = string.charCodeAt(i + 1) - 31;

      // this is to pad the result, because you could have a code that is single digit, which would make 
      // decompression a bit harder
      combinedCharCode = char + "" + nextChar.toLocaleString('en', {
        minimumIntegerDigits: 2
      });

      // You take the concanated code string and convert it back to a number, then a character
      newString += String.fromCharCode(parseInt(combinedCharCode, 10));

    } else {

      // Here because you won't always have pair number length
      newString += string.charAt(i);
    }
  }
  return newString;
}

function decompress(string) {

  var newString = '',
    char, codeStr, firstCharCode, lastCharCode;

  for (var i = 0; i < string.length; i++) {
    char = string.charCodeAt(i);
    if (char > 132) {
      codeStr = char.toString(10);

      // You take the first part of the compressed char code, it's your first letter
      firstCharCode = parseInt(codeStr.substring(0, codeStr.length - 2), 10);

      // For the second one you need to add 31 back.
      lastCharCode = parseInt(codeStr.substring(codeStr.length - 2, codeStr.length), 10) + 31;

      // You put back the 2 characters you had originally
      newString += String.fromCharCode(firstCharCode) + String.fromCharCode(lastCharCode);
    } else {
      newString += string.charAt(i);
    }
  }
  return newString;
}

var stringToCompress = 'I like bananas!';
var compressedString = compress(stringToCompress);
var decompressedString = decompress(compressedString);

document.getElementById('display').innerHTML = stringToCompress + ", length of " + stringToCompress.length  + " characters compressed to " + compressedString + ", length of " + compressedString.length + " characters back to " + decompressedString;
body {
  padding: 10px;
}

#tocompress {
  width: 200px;
}
<input id="tocompress" placeholder="enter string to compress" />
<button id="compressBtn">
  Compress input
</button>
<div id="display">

</div>
另一答案

如果您的字符串只包含ASCII字符[0,127],则可以使用自定义的6位或7位代码页“压缩”字符串。

您可以通过多种方式执行此操作,但我认为其中一种更简单的方法是定义一个包含所有允许字符的数组 - 如果您愿意,可以使用LUT,lookup-table,然后使用其索引值作为编码值。您当然必须手动屏蔽并将编码值移动到类型化数组中。

如果您的LUT看起来像这样:

var lut = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,:;!(){}";

你会在这种情况下处理长度为71的LUT,这意味着我们需要使用7位范围或[0,127](如果长度为64,我们可以将它减少到6位[0,63] ]价值观)。

然后,您将获取字符串中的每个字符并转换为索引值(您通常会在单个操作中执行以下所有步骤,但为了简单起见,我将它们分开):

var lut = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,:;!(){}";
var str = "I like bananas !";
var page = [];

Array.prototype.forEach.call(str, function(ch) {
  var i = lut.indexOf(ch);
  if (i < 0) throw "Invalid character - can't encode";
  page.push(i);
});

console.log("Intermediate page:", page);
另一答案

完整代码请看这里:https://repl.it/NyMl/1

使用Uint8Array你可以使用字节。

let msg = "This is some message";

let data = []

for(let i = 0; i < msg.length; ++i){
  data[i] = msg.charCodeAt(i);
}

let i8 = new Uint8Array(data);
let i16 = new Uint16Array(i8.buffer);

你也可以想到这样的压缩:http://pieroxy.net/blog/pages/lz-string/demo.html

如果您不想使用第三方库,基于lz的压缩应该相当简单。见here (wikipedia)

另一答案

我使用上面提到的相同的库,lz-string https://github.com/pieroxy/lz-string,它创建的文件大小比大多数二进制格式(如协议缓冲区)小。

我通过Node.js压缩如下:

var compressedString = LZString.compressToUTF16(str);

我像这样解压缩客户端:

var decompressedString = LZString.decompressFromUTF16(str);

以上是关于使用Javascript中的utf-32编码缩短utf-8字符串?的主要内容,如果未能解决你的问题,请参考以下文章

Python 中的 UTF-32

Unicode 与 utf8 utf16 utf32的关系

随笔19 关于编码方式

细说:Unicode, UTF-8, UTF-16, UTF-32, UCS-2, UCS-4

细说:Unicode, UTF-8, UTF-16, UTF-32, UCS-2, UCS-4

c/c++如何直接定义utf8类型的字符串?