Reading bytes from a JavaScript string

I have a string containing binary data in JavaScript. Now I want to read, for example, an integer from it. So I get the first 4 characters, use charCodeAt, do some shifting, etc. to get an integer.

The problem is that strings in JavaScript are UTF-16 (instead of ASCII) and charCodeAt often returns values higher than 256.

The Mozilla reference states that "The first 128 Unicode code points are a direct match of the ASCII character encoding." (what about ASCII values > 128?).

How can I convert the result of charCodeAt to an ASCII value? Or is there a better way to convert a string of four characters to a 4 byte integer?

Answers:

Answer

I believe that you can can do this with relatively simple bit operations:

function stringToBytes ( str ) {
  var ch, st, re = [];
  for (var i = 0; i < str.length; i++ ) {
    ch = str.charCodeAt(i);  // get char 
    st = [];                 // set up "stack"
    do {
      st.push( ch & 0xFF );  // push byte to stack
      ch = ch >> 8;          // shift value down by 1 byte
    }  
    while ( ch );
    // add stack contents to result
    // done because chars have "wrong" endianness
    re = re.concat( st.reverse() );
  }
  // return an array of bytes
  return re;
}

stringToBytes( "A\u1242B\u4123C" );  // [65, 18, 66, 66, 65, 35, 67]

It should be a simple matter to sum the output up by reading the byte array as if it were memory and adding it up into larger numbers:

function getIntAt ( arr, offs ) {
  return (arr[offs+0] << 24) +
         (arr[offs+1] << 16) +
         (arr[offs+2] << 8) +
          arr[offs+3];
}

function getWordAt ( arr, offs ) {
  return (arr[offs+0] << 8) +
          arr[offs+1];
}

'\\u' + getWordAt( stringToBytes( "A\u1242" ), 1 ).toString(16);  // "1242"
Answer

Borgar's answer seems correct.

Just wanted to clarify one point. Javascript treats bitwise operations as '32-bit signed int's, where the last (left-most) bit is the sign bit. Ie,

getIntAt([0x7f,0,0,0],0).toString(16)  //  "7f000000"

getIntAt([0x80,0,0,0],0).toString(16)  // "-80000000"

However, for octet-data processing (eg, network stream, etc), usually want the 'unsigned int' representation. This can be accomplished by adding a '>>> 0' (zero-fill right-shift) operator which internally tells Javascript to treat this as unsigned.

function getUIntAt ( arr, offs ) {
  return (arr[offs+0] << 24) +
         (arr[offs+1] << 16) +
         (arr[offs+2] << 8) +
          arr[offs+3] >>> 0;
}

getUIntAt([0x80,0,0,0],0).toString(16)   // "80000000"
Answer

There are two methods for encoding and decoding utf-8 string to a byte array and back.

var utf8 = {}

utf8.toByteArray = function(str) {
    var byteArray = [];
    for (var i = 0; i < str.length; i++)
        if (str.charCodeAt(i) <= 0x7F)
            byteArray.push(str.charCodeAt(i));
        else {
            var h = encodeURIComponent(str.charAt(i)).substr(1).split('%');
            for (var j = 0; j < h.length; j++)
                byteArray.push(parseInt(h[j], 16));
        }
    return byteArray;
};

utf8.parse = function(byteArray) {
    var str = '';
    for (var i = 0; i < byteArray.length; i++)
        str +=  byteArray[i] <= 0x7F?
                byteArray[i] === 0x25 ? "%25" : // %
                String.fromCharCode(byteArray[i]) :
                "%" + byteArray[i].toString(16).toUpperCase();
    return decodeURIComponent(str);
};

// sample
var str = "??!";
var ba = utf8.toByteArray(str);
alert(ba);             // 208, 148, 208, 176, 33
alert(ba.length);      // 5
alert(utf8.parse(ba)); // ??!
Answer

While @Borgar answers the question correctly, his solution is pretty slow. It took me a while to track it down (I used his function somewhere in a larger project), so I thought I would share my insight.

I ended up having something like @Kadm. It's not some little percent faster, it's like 500 times faster (no exaggeration!). I wrote a little benchmark, so you can see it for yourself :)

function stringToBytesFaster ( str ) { 
var ch, st, re = [], j=0;
for (var i = 0; i < str.length; i++ ) { 
    ch = str.charCodeAt(i);
    if(ch < 127)
    {
        re[j++] = ch & 0xFF;
    }
    else
    {
        st = [];    // clear stack
        do {
            st.push( ch & 0xFF );  // push byte to stack
            ch = ch >> 8;          // shift value down by 1 byte
        }
        while ( ch );
        // add stack contents to result
        // done because chars have "wrong" endianness
        st = st.reverse();
        for(var k=0;k<st.length; ++k)
            re[j++] = st[k];
    }
}   
// return an array of bytes
return re; 
}
Answer

Borga's solution works perfectly. In case you want a more concrete implementation, you may want to have a look at the BinaryReader class from vjeux (which, for the records, is based on the binary-parser class from Jonas Raoni Soares Silva).

Answer

How did you get the binary data into the string in the first place? How the binary data gets encoded into a string is an IMPORTANT consideration, and you need an answer to that question before you can proceed.

One way I know of to get binary data into a string, is to use the XHR object, and set it to expect UTF-16.

Once it's in utf-16, you can retrieve 16-bit numbers from the string using "....".charCodeAt(0)

which will be a number between 0 and 65535

Then, if you like, you can convert that number into two numbers between 0 and 255 like this:

var leftByte = mynumber>>>8;
var rightByte = mynumber&255;
Answer

borgars solution improvement:

...
do {
      st.unshift( ch & 0xFF );  // push byte to stack
      ch = ch >> 8;          // shift value down by 1 byte
    }  
    while ( ch );
    // add stack contents to result
    // done because chars have "wrong" endianness
    re = re.concat( st );
...
Answer

One nice and quick hack is to use a combination of encodeURI and unescape :

t=[]; 
for(s=unescape(encodeURI("za?ó?? g??l? ja??")),i=0;i<s.length;++i)
  t.push(s.charCodeAt(i));
t

[122, 97, 197, 188, 195, 179, 197, 130, 196, 135, 32, 103, 196, 153, 197, 155, 108, 196, 133, 32, 106, 97, 197, 186, 197, 132]

Perhaps some explanation is necessary why the heck it works, so let me split it into steps:

 encodeURI("za?ó?? g??l? ja??")

returns

 "za%C5%BC%C3%B3%C5%82%C4%87%20g%C4%99%C5%9Bl%C4%85%20ja%C5%BA%C5%84"

which -- if you look closely -- is the original string in which all characters with values>127 got replaced with (possibly more than one) hexadecimal bytes representations. For example letter "?" became "%C5%BC". The fact is encodeURI escapes also some regular ascii characters like spaces, but it does not matter. What matters is that at this point each byte of the original string is either represented verbatim (as is the case with "z", "a", "g", or "j") or as a percent-encoded sequence of bytes (as was the case with "?" which was originaly two bytes 197 and 188 and got converted to %C5 and %BC).

Now, we apply unescape:

unescape("za%C5%BC%C3%B3%C5%82%C4%87%20g%C4%99%C5%9Bl%C4%85%20ja%C5%BA%C5%84")

which gives

"zażóÅÄ gÄÅlÄ jaźÅ"

If you are not native Polish speaker you might not notice, that this result is in fact way different from the original "za?ó?? g??l? ja??". For starters, it has a different number of characters :) For sure, you can tell, that this strange versions of big letter A do not belong to standard ascii set. In fact this "Å" has value 197. (which is exactly C5 in hexadecimal).

Now, if you are like me, you would ask yourself: wait a minute...if this is really a sequence of bytes with values 122, 97, 197, 188, and JS is really using UTF then why do I see this "ż" characters, and not the original "?" ?

Well, the thing is (I belive) that this sequence 122, 97, 197, 188 (which we see when applying charCodeAt) is not a sequence of bytes, but a sequence of codes. The character "Å" has a code 197, but its actually two bytes long sequence: C3 85.

So, the trick works because unescape treats numbers occuring in percent-encoded string as codes, not as byte values - or, to be more specific: unescape knows nothing about multibyte characters, so when it decodes bytes one-by-one, handling values lower than 128 just great, but not-so-good when they are above 127 and multibyte -- unescape in such cases simply returns a multibyte character which happens to have a code equal to the requested byte value. This "bug" is actually useful feature.

Answer

I'm going to assume for a second that your objective is to read arbitrary bytes from a string. My first suggestion would be to make your string representation a hexidecmal representation of the binary data.

You can read the values using conversions to numbers from hex:

var BITS_PER_BYTE = 8;

function readBytes(hexString, numBytes) {
    return Number( parseInt( hexString.substr(0, numBytes * (BITS_PER_BYTE/4) ),16 ) );
}

function removeBytes(hexString, numBytes) {
    return hexString.substr( numBytes * (BITS_PER_BYTE/BITS_PER_CHAR) );
}

The functions can then be used to read whatever you want:

var hex = '4ef2c3382fd';
alert( 'We had: ' + hex );

var intVal = readBytes(hex,2);
alert( 'Two bytes: ' + intVal.toString(2) );

hex = removeBytes(hex,2);
alert( 'Now we have: ' + hex );

You can then interpret the byte string however you want.

Hope this helps! Cheers!

Tags

Recent Questions

Top Questions

Home Tags Terms of Service Privacy Policy DMCA Contact Us

©2020 All rights reserved.