While working with our data science team, I began to notice something peculiar about their code, which is a combination of Python and C/C++ libraries as you find in iPython or Jupyter notebooks. The amount of code in their programs that is responsible for getting data to where they can work on it almost always overwhelms the amount of code that operates on the data itself. And this intellectual overload, which they rightly think they should not worry about, is a drag on their overall productivity. If they had a way to just operate on the data rather than fooling around with finding, opening, reading, writing, and closing files, to take one example, let alone managing their program's memory when they delve into C and C++ libraries, it seems it would suit them a lot better. When I talk to the team, they just say they accept this situation, because their two choices seem to be either using a framework that's so high-level that the performance is poor, or using a system with all the lower-level knobs exposed that—while fast—is error-prone and exposes them to a lot of the system plumbing. Surely there is a better middle ground somewhere?
No entries found