..  _develop-plugins:

Developer's Guide
*****************

General Development Information
-------------------------------

Development happens on `GitHub <https://github.com/VIDA-NYU/reprozip>`__; bug reports and feature requests are welcome. If you are interested in giving us a hand, please do not hesitate to submit a pull request there.

Continuous testing is provided by `GitHub Actions <https://github.com/VIDA-NYU/reprozip/actions>`__. Note that ReproZip still tries to support Python 2 as well as Python 3. Test coverage is not very high because there are a lot of operations that are difficult to cover on CI (for instance, Vagrant VMs cannot be used over there).

If you have any questions or need help with the development of an unpacker or plugin, please use our development mailing-list at `reprozip@nyu.edu <https://groups.google.com/a/nyu.edu/g/reprozip>`__.

Introduction to ReproZip
------------------------

ReproZip works in two steps: tracing and packing. Under the hood, tracing is two separate steps, leading to the following workflow:

* Running the experiment under trace. During this part, the experiment is running, and the ``_pytracer`` C extension watches it through the `ptrace` mechanism, recording information in the trace SQLite3 database (``.reprozip-trace/trace.sqlite3``). This database contains raw information as it is recorded and does little else, leaving that to the next step. This part is referred to as the "C tracer".
* After the experiment is done, some additional information is computed by the Python code to generate the configuration file, by looking at the trace database and the filesystem. For example, all accesses to a file are aggregated to decide if it is read or written by the overall experiment, if it is an input or output file, resolve symlinks, etc. Additional information is written such as OS information and which distribution package each file comes from.
* Packing reads the configuration file to create the ``.rpz`` bundle, which includes a configuration file (re-written into a "canonical" version), the trace database (though it is not read at this step), and the files listed in the configuration which was possibly altered by the user.

Therefore it is important to note that the configuration file and the trace database contain distinct information, and although the configuration is inferred from the database, it contains some additional details that was obtained from the original machine afterwards.

Only the configuration file should be necessary to run unpackers. The trace database is included for information, and to support additional commands like ``reprounzip graph`` (:ref:`graph`).

Writing Unpackers
-----------------

ReproZip is divided into two steps. The first is packing, which gives a generic package containing the trace SQLite database, the YAML configuration file (which lists the paths, packages, and metadata such as command line, environment variables, and input/output files), and actual files. In the second step, a package can be run using *reprounzip*. This decoupling allows the reproducer to select the unpacker of his/her desire, and also means that when a new unpacker is released, users will be able to use it on their old packages.

Currently, different unpackers are maintained: the defaults ones (``directory`` and ``chroot``), ``vagrant`` (distributed as `reprounzip-vagrant <https://pypi.org/project/reprounzip-vagrant/>`__) and ``docker`` (distributed as `reprounzip-docker <https://pypi.org/project/reprounzip-docker/>`__). However, the interface is such that new unpackers can be easily added. While taking a look at the "official" unpackers' source is probably a good idea, this page gives some useful information about how they work.

ReproZip Bundle Format (``.rpz``)
'''''''''''''''''''''''''''''''''

An ``.rpz`` file is a ``tar.gz`` archive that contains a directory ``METADATA``, which contains meta-information from *reprozip*, and an archive ``DATA.tar.gz``, which contains the actual files that were packed and that will be unpacked to the target directory for reproducing the experiment.

The ``METADATA/version`` file marks the file as a ReproZip bundle. It always contains the string ``REPROZIP VERSION 2``. It previously contained ``REPROZIP VERSION 1`` before version 0.8 (2015), where ``DATA`` was a directory instead of being a tar.gz file.

The ``METADATA/config.yml`` file is in the same format as the configuration file generated by *reprozip*, but without the ``additional_patterns`` section (at this point, it has already been expanded to the actual list of files while packing).

The ``METADATA/trace.sqlite3`` file is the original trace generated by the C tracer and maintained in a SQLite database; it contains all the information about the experiment, in case the configuration file is insufficient in some aspect. This file is used, for instance, by the *graph* unpacker, so that it can recover the exact hierarchy of processes, together with the executable images they execute and the files they access (with the time and mode of these accesses).

..  seealso:: :ref:`Trace Database Schema <trace-schema>`

Structure
'''''''''

An unpacker is a Python module. It can be distributed separately or be a part of a bigger distribution, given that it is declared in that distribution's ``setup.py`` as an `entry_point` to be registered with `pkg_resources` (see `setuptools' advertising behavior section <https://setuptools.pypa.io/en/latest/userguide/entry_point.html#advertising-behavior>`__). You should declare a function as `entry_point` ``reprounzip.unpackers``. The name of the entry_point (before ``=``) will be the *reprounzip* subcommand, and the value is a callable that will get called with the :class:`argparse.ArgumentParser` object for that subcommand.

The package :mod:`reprounzip.unpackers` is a namespace package, so you should be able to add your own unpackers there if you want to. Please remember to put the correct code in the ``__init__.py`` file (which you can copy from `here <https://github.com/VIDA-NYU/reprozip/blob/master/reprounzip/reprounzip/unpackers/__init__.py>`__) so that namespace packages work correctly.

The modules :mod:`reprounzip.common`, :mod:`reprounzip.utils`, and :mod:`reprounzip.unpackers.common` contain utilities that you might want to use (make sure to list *reprounzip* as a requirement in your ``setup.py``).

Example of ``setup.py``::

    setup(name='reprounzip-vagrant',
          namespace_packages=['reprounzip', 'reprounzip.unpackers'],
          install_requires=['reprounzip>=0.4'],
          entry_points={
              'reprounzip.unpackers': [
                  'vagrant = reprounzip.unpackers.vagrant:setup'
                  # The setup() function sets up the parser for reprounzip vagrant
              ]
          }
          # ...
    )

Usual Commands
''''''''''''''

If possible, you should try to follow the same command names that the official unpackers use, which are:

* ``setup``: to create the experiment directory and set everything for execution;
* ``run``: to reproduce the experiment;
* ``destroy``: to bring down all that setup and to prepare and delete the experiment directory safely;
* ``upload`` and ``download``: to replace input files in the experiment, and to get the output files for further examination, respectively.

If these commands can be broken down into different steps that you want to expose to the user, or if you provide completely different actions from these defaults, you can add them to the parser as well. For instance, the *vagrant* unpacker exposes ``setup/start``, which starts or resumes the virtual machine, and ``destroy/vm``, which stops and deallocates the virtual machine but leaves the template for possible reuse.

A Note on File Paths
''''''''''''''''''''

ReproZip supports Python 2 and 3, is portable to different operating systems, and is meant to accept a wide variety of configurations so that it is compatible with most experiments out there. Even trickier, `reprounzip-vagrant` needs to manipulate POSIX filenames on Windows, e.g.: in the unpacker.
Therefore, the `rpaths <https://github.com/remram44/rpaths>`__ library is used everywhere internally. You should make sure to use the correct type of path (either :class:`~rpaths.PosixPath` or :class:`~rpaths.Path`) and to cast these to the type that Python functions expect, keeping in mind 2/3 differences (most certainly either ``filename.path`` or ``str(filename)``).

Experiment Directory Format
'''''''''''''''''''''''''''

Unpackers usually create a directory with everything necessary to later run the experiment. This directory is created by the ``setup`` operation, cleaned up by ``destroy``, and is the argument to every command. For example, with `reprounzip-vagrant`::

    $ reprounzip vagrant setup someexperiment.rpz mydirectory
    $ reprounzip vagrant upload mydirectory /tmp/replace.txt:input_text

Unpackers unpack the config.yml file to the root of that directory, and keep status information in a ``.reprounzip`` file, which is a dict in :mod:`pickle` format. Following the same structure will allow the ``showfiles`` command, as well as :class:`~reprounzip.unpackers.common.FileUploader` and :class:`~reprounzip.unpackers.common.FileDownloader` classes, to work correctly. Please try to follow this structure.

Signals
'''''''

Since version 0.4.1, `reprounzip` has signals that can be used to hook in plugins, although no such plugin has been released at this time. To ensure that these work correctly when using your unpacker, you should emit them when appropriate. The complete list of signals is available in `signal.py <https://github.com/VIDA-NYU/reprozip/blob/master/reprounzip/reprounzip/signals.py>`__.

Final Observations
------------------

After reading this page, reading the source code of one of the "official" unpackers is probably the best way of understanding how to write your own. They should be short enough to be easy to grasp. Should you have additional questions, do not hesitate to use our mailing-list: `reprozip@nyu.edu`.