11 comments on “How to read an entire file into memory in C++

  1. Nice article. Didn’t know about the seek at the end of file problems.

    How does the admission of filesystem into TR2 change things? We can then find the file size portably and then just read from begin to end, or?

    • There is no TR2; I presume you mean the Filesystem TS that’s *almost* complete.

      I don’t think it helps at all, really. The first problem is whether the file_size() function actually does return the right size. The spec simply defers to POSIX and stat(), and I’m not a POSIX expert, so I can’t say for sure whether stat() gives you the size of the file on disk or the size of the file’s data… which may not be the same.

      But even if stat() does actually tell you the size of the file’s data, it still won’t help much. You’d have to watch out for race conditions where you get the file size, then open the file… and in the meantime the file’s been changed. You might think you could avoid this problem by opening the file then checking the file size… but it’s possible that you might be burned by race conditions even then if you’re reading from a cached version while the actual-on-disk version has changed.

      And even if *all* of that isn’t a problem, that method still won’t work in text mode. Windows line-ending translations translate two bytes (“\r\n”) to one.. which means if there are any newlines in the file (which seems likely), the size in memory will be less than the size of the file on disk. There’s also the DOS 0x1A gotcha, and who knows what other translations other platforms might do, including future ones. Even if you could reliably get the file size, it won’t help.

      So I’d say the problem remains, even if the Filesystem TS becomes part of some future standard.

  2. I did like your article but I’ve had some problems using your code from the Summary section.

    1 – Typos – look at line 1 of the first Summary sample code
    template ,
    typename Allocator = std::allocator>

    2- Why not typename Traits = std::char_traits for all the templates

    3 – main problem is the deque example which plainly does not work for filesizes that are not multiples of BUFSIZE as when in.read does that last partial read it bombs out of the while loop without the container.insert being called.

    Using clang++ 3.5/g++ 4.9.2 on debian

    • 1 – The sample code was not meant to be copy-pasted and used as is. It has all been transformed so many times to make it display properly on a WordPress blog with and without various syntax highlighting plugins – and corrupted so many times during those translations – that it’s almost certain that some errors have slipped by. I have had a hell of a time getting code to display properly, to the point that I’ve seriously considered abandoning WordPress altogether.

      I did fix the stray angle bracket in the summary sample though, thanks.

      2 – Because it’s unnecessary. Char and Traits both get automatically deduced from the stream you use. Allocator doesn’t unless you provide the allocator argument. If I didn’t make the allocator argument optional, the default for Allocator wouldn’t be necessary either… but that would make the function less easy to use.

      The only times I used default values for the function template arguments is when they can’t be deduced from the types, or when they follow other defaulted template arguments.

      3 – You are correct – I forgot the other half of the while condition. It must have gotten lopped off when I was chopping lines up to make them short enough to display.

      The samples have been corrected, thanks.

  3. I did test it on Windows only and did measure the time.
    No matter how big the file (~42kb, ~30mb & ~500mb), the results where always the same:
    1. ignore-gcount() method
    2. read-chunks-deque method
    3. rdbuf()

    Always in this order (from fastest 1. to slowest 3.)

  4. > (the operating system’s memory manager can more easily shuffle the chunks around if it needs to make more space).

    Eh, what? ;)

    What about memory mapping files?

    I admit to (ab)using fstat(fileno(f), &b) to get the size of a FILE*. If the size changes between getting it and reading I simply return an error.
    I do wish std C++ or Boost had a simple file_get function though, returning something like shared_array.

  5. Regarding the observation about using ftell() in binary mode:

    “To understand why this is the case: Some platforms store files as fixed-size records. If the file is shorter than the record size, the rest of the block is padded.”

    Are you aware of any platform where this is the case?

    • I have never used any myself – I grew up in a world with POSIX – so I can’t give an authoritative answer. But I’ve heard that IBM’s CMS and MVS file systems are like this. I’ve also heard that OpenVMS is, too, though I’m less sure of this. Apparently a lot of ancient tape-based systems worked this way.

      I think CP/M was like this, too, and if you’re looking for a single villain to blame for *all* of this wackiness – the difference between text and binary files, and the fact that seeking to the end of a file may not actually seek to the end of a file – CP/M, from which MS-DOS sprang, is probably the one to point the finger at.

Leave a Reply

Your email address will not be published. Required fields are marked *