The modern World runs on open-source software. A prime example is the Linux kernel, embedded in most machines connected to the Internet. Most machine-learning applications are also built on top of open-source libraries. Digisalix’s tools for structured document content extraction are no exception to that rule.
These foundational open-source libraries are flexible building blocks: one can tweak them at will as needs arise. Sharing the improvements moves the whole industry forwards as a community effort.
We at Digisalix recently contributed code to the PyTorch machine learning library, one of the tools we use. A story of a performance issue where memory allocations were waiting at the end of the rabbit hole. It all started when we noticed that a bilinear interpolation operation was unexpectedly slow when computing gradients through it.
Bilinear interpolation approximates the colour value in a location that is in-between pixels in a digital image. The operation takes as input a batch of images and grids of locations where we’d like to get the interpolated values (Grid sample docs). In our case, the images are static (but possibly large) and we optimise the sampling locations with gradient descent.
We noticed that the gradient computation slowed down when the image size increased. Yet, the performance of the operation should mostly be independent of the image size as it only looks at fixed-size pixel neighbourhoods around the locations in the grids.
To pinpoint the issue within the PyTorch codebase, we used the py-spy profiler that can handle Python C++ extensions. Native code profiling support is crucial as the core of PyTorch is implemented in C++ and CUDA. To our surprise, we found that most of the time was spent on creating a zero-filled array to hold the gradient with regard to the input images. It turned out that the input image gradients are always computed, even if they are not needed. Although the interpolation operation is fast by itself, the time to allocate memory for the gradients grows with image size and starts to dominate the run time.
PyTorch has excellent built-in facilities for adding specialised code paths based on, for example, whether gradients of the operation are needed later in the computation. With guidance from the friendly PyTorch developers, we modified the C++ and CUDA code for bilinear sampling to skip the image gradients when appropriate. Presto! Few orders of magnitude faster bilinear sampling for our use case.
For more details of the code changes, see the links to the pull requests in the GitHub issue. Likely our patch will be included in the future releases after 1.10.