Using automatic persistent memoization to facilitate data analysis scripting
- 17 July 2011
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 287-297
- https://doi.org/10.1145/2001420.2001455
Abstract
Programmers across a wide range of disciplines (e.g., bioinformatics, neuroscience, econometrics, finance, data mining, information retrieval, machine learning) write scripts to parse, transform, process, and extract insights from data. To speed up iteration times, they split their analyses into stages and write extra code to save the intermediate results of each stage to files so that those results do not have to be re-computed in every subsequent run. As they explore and refine hypotheses, their scripts often create and process lots of intermediate data files. They need to properly manage the myriad of dependencies between their code and data files, or else their analyses will produce incorrect results. To enable programmers to iterate quickly without needing to manage intermediate data files, we added a set of dynamic analyses to the programming language interpreter so that it automatically memoizes (caches) the results of long-running pure function calls to disk, manages dependencies between code and on-disk data, and later re-uses memoized results, rather than re-executing those functions, when guaranteed safe to do so. We created an implementation for Python and show how it enables programmers to iterate faster on their data analysis scripts while writing less code and not having to manage dependencies between their code and datasets.Keywords
Funding Information
- Air Force Research Laboratory (FA8650-10-C-7024)
This publication has 17 references indexed in Scilit:
- The state of the art in end-user software engineeringACM Computing Surveys, 2011
- MetamousePublished by Association for Computing Machinery (ACM) ,2010
- Querying and re-using workflows with VsTrailsPublished by Association for Computing Machinery (ACM) ,2008
- How to shadow every byte of memory used by a programPublished by Association for Computing Machinery (ACM) ,2007
- What Supercomputers Say: A Study of Five System LogsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2007
- Software Development Environments for Scientific and Engineering Software: A Series of Case StudiesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2007
- Purity and Side Effect Analysis for Java ProgramsLecture Notes in Computer Science, 2005
- An experimental evaluation of continuous testing during developmentPublished by Association for Computing Machinery (ACM) ,2004
- Caching function calls using precise dependenciesPublished by Association for Computing Machinery (ACM) ,2000
- Make — a program for maintaining computer programsSoftware: Practice and Experience, 1979