Sign in to your Python Morsels account to save your screencast settings.
Don't have an account yet? Sign up here.
How does file buffering work in Python?
When you open a file to read from it in binary mode, you'll get back a BufferedReader object:
>>> f = open("exercises.zip", mode="rb")
>>> f
<_io.BufferedReader name='exercises.zip'>
When you open a file to read from it in text mode, you'll get back a TextIOWrapper object:
>>> f = open("my_file.txt")
>>> f
<_io.TextIOWrapper name='my_file.txt' mode='r' encoding='UTF-8'>
But this TextIOWrapper object has a buffer attribute that points to a BufferedReader object:
>>> f.buffer
<_io.BufferedReader name='my_file.txt'>
So reading from a text file actually involves reading from a binary file under the hood.
Looping over a text file will read that file line-by-line:
>>> with open("frankenstein.txt") as f:
... for n, line in enumerate(f, start=1):
... if "monster" in line.casefold():
... print(f"First appearance of monster on line {n}")
... break
...
First appearance of monster on line 1564
But Python isn't going to disk to read another line for each iteration of our loop, because that would be inefficient.
Instead, Python stores a buffer of bytes in the BufferedReader object that our text file wraps around, and then our text file keeps track of where it left off within that buffer.
If we consume just one line from our file and then we ask our file for its current position using the file tell method, it'll tell us that it's not very far along:
>>> f = open("frankenstein.txt")
>>> f.readline()
'\ufeffThe Project Gutenberg eBook of Frankenstein, by Mary Wollstonecraft (Godwin)
Shelley\n'
>>> f.tell()
89
But if we ask the buffer for its current position, it tells us that it's much further along:
>>> f.buffer.tell()
8192
If we then read more lines from the file, we would see that while our text file object's position moved, the binary file object that it wraps around didn't move:
>>> more_lines = [f.readline() for _ in range(5)]
>>> f.tell()
380
>>> f.buffer.tell()
8192
We'd only move that buffer position if we read a sufficient number of bytes from our file.
Here we're reading many more lines:
>>> even_more_lines = [f.readline() for _ in range(200)]
>>> f.buffer.tell()
16384
The buffer position moved once the data in our buffer was exhausted.
Python's default buffer size is defined in the io module in a variable called DEFAULT_BUFFER_SIZE, and it's 8 kilobytes (8192 bytes):
>>> import io
>>> io.DEFAULT_BUFFER_SIZE
8192
Though, changing this value won't actually change how buffering works.
If you need to use a different buffer size for some strange reason, you'll need to manually disable buffering.
You can do this by opening your file in binary mode, and then using a BufferedReader object to specify a different buffer size:
>>> import io
>>> with open("frankenstein.txt", mode="rb", buffering=0) as raw:
... f = io.BufferedReader(raw, buffer_size=128)
... print(f.read(16))
... print("buffered file:", f.tell())
... print("raw file:", raw.tell())
...
b'\xef\xbb\xbfThe Project G'
buffered file: 16
raw file: 128
Our raw file used our custom buffer size (128 bytes), but the file that is wrapping around it only read 16 bytes.
Manually buffering files is a strange thing to do. We don't usually customize buffering while reading from files. And when we do so, it's pretty much only ever for binary files.
Python also buffers when writing to files. Here we have some code which writes a bunch of characters to a file in a loop:
>>> f = open("just_dots.txt", mode="wt")
>>> chars = 0
>>> for _ in range(9):
... f.write("." * 2048)
... chars += 2048
... print(chars, "chars,", f.buffer.tell(), "buffer,", f.buffer.raw.tell(), "raw")
...
When we run this code, we'll see that the underlying buffer (f.buffer) and the raw file (f.buffer.raw) don't actually change their positions:
>>> ...
2048
2048 chars, 0 buffer, 0 raw
2048
4096 chars, 0 buffer, 0 raw
2048
6144 chars, 0 buffer, 0 raw
2048
Their position remains the same until we hit a certain character threshold (until our buffer has filled up):
>>> ...
8192 chars, 8192 buffer, 8192 raw
2048
10240 chars, 8192 buffer, 8192 raw
2048
12288 chars, 8192 buffer, 8192 raw
2048
14336 chars, 8192 buffer, 8192 raw
2048
16384 chars, 16384 buffer, 16384 raw
2048
18432 chars, 16384 buffer, 16384 raw
So when Python writes to a file, it doesn't write to disk at all until the buffer fills and it has to write to disk.
We can disable that functionality by disabling buffering when we're writing to our file, and then wrapping around the binary file with a TextIOWrapper telling it to write through, which means it also shouldn't be doing its own buffering.
Here we've completely disabled buffering while writing to our file:
>>> import io
>>> buffer = open("just_dots.txt", mode="wb", buffering=0)
>>> f = io.TextIOWrapper(buffer, write_through=True)
>>> chars = 0
>>> for _ in range(9):
... f.write("." * 2048)
... chars += 2048
... print(chars, "chars,", f.buffer.tell(), "buffer")
...
2048
2048 chars, 2048 buffer
2048
4096 chars, 4096 buffer
2048
6144 chars, 6144 buffer
2048
8192 chars, 8192 buffer
2048
10240 chars, 10240 buffer
2048
12288 chars, 12288 buffer
2048
14336 chars, 14336 buffer
2048
16384 chars, 16384 buffer
2048
18432 chars, 18432 buffer
Disabling buffering is usually a bad idea. Buffering is usually a good idea because writing to disk frequently takes time, which can slow down our code.
Usually, instead of disabling write buffering entirely, it's a better idea to use the flush method when you need to force a write to disk:
>>> f = open("just_dots.txt", mode="wt")
>>> chars = 0
>>> for _ in range(9):
... f.write("." * 2048)
... f.flush()
... chars += 2048
... print(chars, "chars,", f.buffer.tell(), "buffer,", f.buffer.raw.tell(), "raw")
...
2048
2048 chars, 2048 buffer, 2048 raw
2048
4096 chars, 4096 buffer, 4096 raw
2048
6144 chars, 6144 buffer, 6144 raw
2048
8192 chars, 8192 buffer, 8192 raw
2048
10240 chars, 10240 buffer, 10240 raw
2048
12288 chars, 12288 buffer, 12288 raw
2048
14336 chars, 14336 buffer, 14336 raw
2048
16384 chars, 16384 buffer, 16384 raw
2048
18432 chars, 18432 buffer, 18432 raw
The flush method basically tells Python to write the current buffer to disk immediately.
The file flush method isn't used very often because Python automatically flushes files when they're closed.
Here we're using a with block, to make sure our file is closed as soon as we're done working with it:
>>> with open("hello.txt", mode="wt") as f:
... for _ in range(3):
... f.write("Hello world\n")
...
12
12
12
When using a with block, Python will close the file (and flush it) as soon as we're done working with it.
Need to fill-in gaps in your Python skills?
Sign up for my Python newsletter where I share one of my favorite Python tips every week.
Sign in to your Python Morsels account to track your progress.
Don't have an account yet? Sign up here.