Previously, the _BlocksOutputBuffer code creates a list of bytes objects to handle the output data from compression libraries. This ends up being slow due to the output buffer code needing to copy each bytes element of the list into the final bytes object buffer at the end of compression.
The new PyBytesWriter API introduced in PEP 782 is an ergonomic and fast method of writing data into a buffer that will later turn into a bytes object. Benchmarks show that using the PyBytesWriter API is 10-30% faster for decompression across a variety of settings. The performance gains are greatest when the decompressor is very performant, such as for Zstandard (and likely zlib-ng). Otherwise the decompressor can bottleneck decompression and the gains are more modest, but still sizable (e.g. 10% faster for zlib)!
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
The following types are now immutable:
* `_curses_panel.panel`,
* `[posix,nt].ScandirIterator`, `[posix,nt].DirEntry` (exposed in `os.py`),
* `_remote_debugging.RemoteUnwinder`,
* `_tkinter.Tcl_Obj`, `_tkinter.tkapp`, `_tkinter.tktimertoken`,
* `zlib.Compress`, and `zlib.Decompress`.
Fix warnings when using -Wimplicit-fallthrough compiler flag.
Annotate explicitly "fall through" switch cases with a new
_Py_FALLTHROUGH macro which uses __attribute__((fallthrough)) if
available. Replace "fall through" comments with _Py_FALLTHROUGH.
Add _Py__has_attribute() macro. No longer define __has_attribute()
macro if it's not defined. Move also _Py__has_builtin() at the top
of pyport.h.
Co-Authored-By: Nikita Sobolev <mail@sobolevn.me>
This PR adds the ability to enable the GIL if it was disabled at
interpreter startup, and modifies the multi-phase module initialization
path to enable the GIL when loading a module, unless that module's spec
includes a slot indicating it can run safely without the GIL.
PEP 703 called the constant for the slot `Py_mod_gil_not_used`; I went
with `Py_MOD_GIL_NOT_USED` for consistency with gh-104148.
A warning will be issued up to once per interpreter for the first
GIL-using module that is loaded. If `-v` is given, a shorter message
will be printed to stderr every time a GIL-using module is loaded
(including the first one that issues a warning).
Work around a macOS bug, limit zlib crc32 calls to 1GiB.
Without this, `zlib.crc32` and `binascii.crc32` could produce incorrect
results on multi-gigabyte inputs depending on the macOS version's Apple
supplied zlib implementation.
* pycore_intrinsics.h does nothing if included twice
(add #ifndef and #define).
* Update Tools/cases_generator/generate_cases.py to generate the
Py_BUILD_CORE test.
* _bz2, _lzma, _opcode and zlib extensions now define the
Py_BUILD_CORE_MODULE macro to use internal headers
(pycore_code.h, pycore_intrinsics.h and pycore_blocks_output_buffer.h).
Here we are doing no more than adding the value for Py_mod_multiple_interpreters and using it for stdlib modules. We will start checking for it in gh-104206 (once PyInterpreterState.ceval.own_gil is added in gh-104204).
Change summary:
+ There is now a `gzip.READ_BUFFER_SIZE` constant that is 128KB. Other programs that read in 128KB chunks: pigz and cat. So this seems best practice among good programs. Also it is faster than 8 kb chunks.
+ a zlib._ZlibDecompressor was added. This is the _bz2.BZ2Decompressor ported to zlib. Since the zlib.Decompress object is better for in-memory decompression, the _ZlibDecompressor is hidden. It only makes sense in file decompression, and that is already implemented now in the gzip library. No need to bother the users with this.
+ The ZlibDecompressor uses the older Cpython arrange_output_buffer functions, as those are faster and more appropriate for the use case.
+ GzipFile.read has been optimized. There is no longer a `unconsumed_tail` member to write back to padded file. This is instead handled by the ZlibDecompressor itself, which has an internal buffer. `_add_read_data` has been inlined, as it was just two calls.
EDIT: While I am adding improvements anyway, I figured I could add another one-liner optimization now to the python -m gzip application. That read chunks in io.DEFAULT_BUFFER_SIZE previously, but has been updated now to use READ_BUFFER_SIZE chunks.
When compiled with `USE_ZLIB_CRC32` defined (`configure` sets this on POSIX systems), `binascii.crc32(...)` failed to compute the correct value when the input data was >= 4GiB. Because the zlib crc32 API is limited to a 32-bit length.
This lines it up with the `zlib.crc32(...)` implementation that doesn't have that flaw.
**Performance:** This also adopts the same GIL releasing for larger inputs logic that `zlib.crc32` has, and causes the Windows build to always use zlib's crc32 instead of our slow C code as zlib is a required build dependency on Windows.
Clarifies a versionchanged note on crc32 & adler32 docs that the workaround is only needed for Python 2 and earlier.
Also cleans up an unnecessary intermediate variable in the implementation.
Authored-By: Ma Lin / animalize
Co-authored-by: Gregory P. Smith <greg@krypto.org>
* zlib uses an UINT32_MAX sliding window for the output buffer
These funtions have an initial output buffer size parameter:
- zlib.decompress(data, /, wbits=MAX_WBITS, bufsize=DEF_BUF_SIZE)
- zlib.Decompress.flush([length])
If the initial size > UINT32_MAX, use an UINT32_MAX sliding window, instead of clamping to UINT32_MAX.
Speed up when (the initial size == the actual size).
This fixes a memory consumption and copying performance regression in earlier 3.10 beta releases if someone used an output buffer larger than 4GiB with zlib.decompress.
Reviewed-by: Gregory P. Smith
* Fix initial buffer size can't > UINT32_MAX in zlib module
After commit f9bedb630e, in 64-bit build,
if the initial buffer size > UINT32_MAX, ValueError will be raised.
These two functions are affected:
1. zlib.decompress(data, /, wbits=MAX_WBITS, bufsize=DEF_BUF_SIZE)
2. zlib.Decompress.flush([length])
This commit re-allows the size > UINT32_MAX.
* adds curly braces per PEP 7.
* Renames `Buffer_*` to `OutputBuffer_*` for clarity
Faster bz2/lzma/zlib via new output buffering.
Also adds .readall() function to _compression.DecompressReader class
to take best advantage of this in the consume-all-output at once scenario.
Often a 5-20% speedup in common scenarios due to less data copying.
Contributed by Ma Lin.
No longer use deprecated aliases to functions:
* Replace PyObject_MALLOC() with PyObject_Malloc()
* Replace PyObject_REALLOC() with PyObject_Realloc()
* Replace PyObject_FREE() with PyObject_Free()
* Replace PyObject_Del() with PyObject_Free()
* Replace PyObject_DEL() with PyObject_Free()
* Fix potential division by zero in BZ2_Malloc()
* Avoid division by zero in PyLzma_Malloc()
* Avoid division by zero and integer overflow in PyZlib_Malloc()
Reported by Svace static analyzer.