HTML5 File API read as text and binary

I am currently working on the HTML5 File API, and I need to get binary file data. The FileReader's readAsText, and readAsDataURL methods work fine, but readAsBinaryString returns the same data as readAsText.

I need binary data, but im getting a text string. Am I missing something?

Answers:

Answer

Note in 2018: readAsBinaryString is outdated. For use cases where previously you'd have used it, these days you'd use readAsArrayBuffer (or in some cases, readAsDataURL) instead.


readAsBinaryString says that the data must be represented as a binary string, where:

...every byte is represented by an integer in the range [0..255].

JavaScript originally didn't have a "binary" type (until ECMAScript 5's WebGL support of Typed Array* (details below) -- it has been superseded by ECMAScript 2015's ArrayBuffer) and so they went with a String with the guarantee that no character stored in the String would be outside the range 0..255. (They could have gone with an array of Numbers instead, but they didn't; perhaps large Strings are more memory-efficient than large arrays of Numbers, since Numbers are floating-point.)

If you're reading a file that's mostly text in a western script (mostly English, for instance), then that string is going to look a lot like text. If you read a file with Unicode characters in it, you should notice a difference, since JavaScript strings are UTF-16** (details below) and so some characters will have values above 255, whereas a "binary string" according to the File API spec wouldn't have any values above 255 (you'd have two individual "characters" for the two bytes of the Unicode code point).

If you're reading a file that's not text at all (an image, perhaps), you'll probably still get a very similar result between readAsText and readAsBinaryString, but with readAsBinaryString you know that there won't be any attempt to interpret multi-byte sequences as characters. You don't know that if you use readAsText, because readAsText will use an encoding determination to try to figure out what the file's encoding is and then map it to JavaScript's UTF-16 strings.

You can see the effect if you create a file and store it in something other than ASCII or UTF-8. (In Windows you can do this via Notepad; the "Save As" as an encoding drop-down with "Unicode" on it, by which looking at the data they seem to mean UTF-16; I'm sure Mac OS and *nix editors have a similar feature.) Here's a page that dumps the result of reading a file both ways:

<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
<title>Show File Data</title>
<style type='text/css'>
body {
    font-family: sans-serif;
}
</style>
<script type='text/javascript'>

    function loadFile() {
        var input, file, fr;

        if (typeof window.FileReader !== 'function') {
            bodyAppend("p", "The file API isn't supported on this browser yet.");
            return;
        }

        input = document.getElementById('fileinput');
        if (!input) {
            bodyAppend("p", "Um, couldn't find the fileinput element.");
        }
        else if (!input.files) {
            bodyAppend("p", "This browser doesn't seem to support the `files` property of file inputs.");
        }
        else if (!input.files[0]) {
            bodyAppend("p", "Please select a file before clicking 'Load'");
        }
        else {
            file = input.files[0];
            fr = new FileReader();
            fr.onload = receivedText;
            fr.readAsText(file);
        }

        function receivedText() {
            showResult(fr, "Text");

            fr = new FileReader();
            fr.onload = receivedBinary;
            fr.readAsBinaryString(file);
        }

        function receivedBinary() {
            showResult(fr, "Binary");
        }
    }

    function showResult(fr, label) {
        var markup, result, n, aByte, byteStr;

        markup = [];
        result = fr.result;
        for (n = 0; n < result.length; ++n) {
            aByte = result.charCodeAt(n);
            byteStr = aByte.toString(16);
            if (byteStr.length < 2) {
                byteStr = "0" + byteStr;
            }
            markup.push(byteStr);
        }
        bodyAppend("p", label + " (" + result.length + "):");
        bodyAppend("pre", markup.join(" "));
    }

    function bodyAppend(tagName, innerHTML) {
        var elm;

        elm = document.createElement(tagName);
        elm.innerHTML = innerHTML;
        document.body.appendChild(elm);
    }

</script>
</head>
<body>
<form action='#' onsubmit="return false;">
<input type='file' id='fileinput'>
<input type='button' id='btnLoad' value='Load' onclick='loadFile();'>
</form>
</body>
</html>

If I use that with a "Testing 1 2 3" file stored in UTF-16, here are the results I get:

Text (13):

54 65 73 74 69 6e 67 20 31 20 32 20 33

Binary (28):

ff fe 54 00 65 00 73 00 74 00 69 00 6e 00 67 00 20 00 31 00 20 00 32 00 20 00 33 00

As you can see, readAsText interpreted the characters and so I got 13 (the length of "Testing 1 2 3"), and readAsBinaryString didn't, and so I got 28 (the two-byte BOM plus two bytes for each character).


* XMLHttpRequest.response with responseType = "arraybuffer" is supported in HTML 5.

** "JavaScript strings are UTF-16" may seem like an odd statement; aren't they just Unicode? No, a JavaScript string is a series of UTF-16 code units; you see surrogate pairs as two individual JavaScript "characters" even though, in fact, the surrogate pair as a whole is just one character. See the link for details.

Tags

Recent Questions

Top Questions

Home Tags Terms of Service Privacy Policy DMCA Contact Us

©2020 All rights reserved.