..  _packing:

Using *reprozip*
****************

The *reprozip* component is responsible for packing an experiment, which is done in three steps: :ref:`tracing the experiment <packing-trace>`, :ref:`editing the configuration file <packing-config>` (if necessary), and :ref:`creating the reproducible package <packing-pack>`. Each of these steps is explained in more details below. Please note that *reprozip* is only available for Linux distributions.

..  _packing-trace:

Tracing an Experiment
=====================

First, *reprozip* needs to trace the operating system calls used by the experiment, so as to identify all the necessary information for its future re-execution, such as binaries, files, library dependencies, and environment variables.

The following command is used to trace a command line, or a `run`, used by the experiment::

    $ reprozip trace <command-line>

where `<command-line>` is the command line. By running this command, *reprozip* executes `<command-line>` and uses ``ptrace`` to trace all the system calls issued, storing them in an SQLite database.

If you run the command multiple times, *reprozip* might ask you if you want to continue with your current trace (append the new command-line to it) or replace it (throw away the previous command-line you traced). You can skip this prompt by using either the ``--continue`` or ``--overwrite`` flag, like this::

    $ reprozip trace --continue <command-line>

Note that the final bundle will be able to reproduce any of the runs, and files shared by multiple runs are only stored once.

By default, if the operating system is based on Debian or RPM packages (e.g.: Ubuntu, CentOS, Fedora, ...), *reprozip* will also try to automatically identify the distribution packages from which the files come, using the available package manager of the system. This is useful to provide more detailed information about the dependencies, as well as to further help when reproducing the experiment. However, note that the ``trace`` command can take some time doing that after the experiment finishes, depending on the number of file dependencies that the experiment has. To disable this feature, users may use the flag ``--dont-identify-packages``::

    $ reprozip trace --dont-identify-packages <command-line>

The database, together with a *configuration file* (see below), are placed in a directory named ``.reprozip-trace``, created under the path where the ``reprozip trace`` command was issued.

..  _packing-config:

Editing the Configuration File
==============================

The configuration file, which can be found in ``.reprozip-trace/config.yml``, contains all the information necessary for creating the experiment bundle. This file is generated by the tracer and drives the packing step.

It is very likely that you won't need to modify this file, as the automatically-generated one should be sufficient to create a working bundle. However, in some cases, you may want to edit it prior to the creation of the package to add or remove files used by your experiment. This can be particularly useful, for instance, to remove big files that can be obtained elsewhere when reproducing the experiment, to keep the size of package small, and also to remove sensitive information that the experiment may use. The configuration file can also be used to edit the main command line, to add or remove environment variables, and to edit information regarding input/output files.

..  _packing-config-general:

The first part of the configuration file gives general information with respect to the experiment and its runs, including command lines, environment variables, working directory, and machine information. Also, each run has a unique identifier (given by ``id``) that is consistently used for packing and unpacking purposes::

    # Run info
    version: <reprozip-version>
    runs:
    # Run 0
    - id: <run-id>
      architecture: <machine-architecture>
      argv: <command-line-arguments>
      binary: <command-line-binary>
      distribution: <linux-distribution>
      environ: <environment-variables>
      exitcode: <exit-code>
      gid: <group-id>
      hostname: <machine-hostname>
      system: <system-kernel>
      uid: <user-id>
      workingdir: <working-directory>

    # Run 1
    - id: ...
    ...

If necessary, users may change command line parameters by editing ``argv``, and add or remove environment variables by editing ``environ``. Users may also give a more meaningful and user-friendly identifier for a run by changing ``id``. Other attributes should not be changed in general.

..  _packing-config-inputoutput:

The next section brings information about input and output files, including their original paths and which runs read and/or wrote them. These are the files that `reprozip` identified as the main input or result of the experiment, which `reprounzip` will later be able to replace and extract from the experiment when reproducing it. You may add, remove, or edit these files in case `reprozip` fails in recognizing any important information, as well as give meaningful names to them by editing ``name``::

    # Inputs are files that are only read by a run; reprounzip can replace these
    # files on demand to run the experiment with custom data.
    # Outputs are files that are generated by a run; reprounzip can extract these
    # files from the experiment on demand, for the user to examine.
    # The name field is the identifier the user will use to access these files.
    inputs_outputs:
      - name: <file-identifier>
        path: <path-to-file>
        read_by_runs: <run-ids>
        written_by_runs: <run-ids>
      - name: ...
      ...

Note that you can prevent `reprozip` from identifying which files are input or output by using the ``--dont-find-inputs-outputs`` flag in the ``reprozip trace`` command.

..  note:: To visualize the dataflow of the experiment, pleaser refer to :ref:`graph`.

..  seealso:: :ref:`Why doesn’t 'reprozip' identify my input/output file? <file_id>`

..  _packing-config-files:

The next section in the configuration file lists all the files to be packed. If the software dependencies were identified by the package manager of the system during the ``reprozip trace`` command, they will be organized in software packages and listed under ``packages``; otherwise, file dependencies will be listed under ``other_files``::

    packages:
      - name: <package-name>
        version: <package-version>
        size: <package-size>
        packfiles: <include-package>
        files:
          # Total files used: <used-files-size>
          # Installed package size: <package-size>
          <files-list>
      - name: ...
      ...

    other_files:
      <files-list>

The attribute ``packfiles`` can be used to control whether a software package will be packed: its default value is `true`, but users may change it to `false` to inform *reprozip* that the corresponding software package should not be included. To remove a file that was not identified as part of a package, users can simply remove it from the list under ``other_files``.

..  warning::

    Note that if a software package is requested not to be included, the `reprounzip` component will try to install it from a package manager when unpacking the experiment. If the software version from the package manager is different from (and incompatible with) the one used by the experiment, the experiment may not be reproduced correctly.

..  seealso:: :ref:`Why does 'reprounzip run' fail with "no such file or directory" or similar? <nosuchfile>`

..  _packing-config-patterns:

Last, users may add file patterns under ``additional_patterns`` to include other files that they think it will be useful for a future reproduction. As an example, the following would add everything under ``/etc/apache2/`` and all the Python files of all users from LXC containers (contrived example)::

    additional_patterns:
      - /etc/apache2/**
      - /var/lib/lxc/*/rootfs/home/**/*.py

Note that users can always reset the configuration file to its initial state by running the following command::

    $ reprozip reset

..  warning::

    When editing a configuration file, make sure your changes are as restrictive as possible, modifying only the necessary information. Removing important information and changing the structure of the file may cause issues while creating the bundle or unpacking the experiment.

..  _packing-pack:

Creating a Bundle
=================

After tracing all the runs from the experiment and optionally editing the configuration file, the experiment bundle can be created by using the following command::

    $ reprozip pack <bundle>

where `<bundle>` is the name given to the package. This command generates a ``.rpz`` file in the current directory, which can then be sent to others so that the experiment can be reproduced. For more information regarding the unpacking step, please see :ref:`unpacking`.

Note that, by using ``reprozip pack``, files will be copied from your environment to the package; as such, you should not change any file that the experiment used before packing it, otherwise the package will contain different files from the ones the experiment used when it was originally traced.

..  warning::

    Before sending your bundle to others, it is advisable to test it and ensure that the reproduction of the experiment works.

..  _packing-further:

Further Considerations
======================

Packing Multiple Command Lines
++++++++++++++++++++++++++++++

As mentioned before, ReproZip allows multiple runs (i.e., command lines) to be traced and included in the same bundle. Alternatively, users can create a simple **script** that runs all the command lines, and pass *that* to ``reprozip trace``. However, in this case, there will be no flexibility in choosing a single run to be reproduced, since the entire script will be re-executed.

Note that this flexibility has the caveat that users may reproduce the runs in a different order than the one originally used while tracing. If the order is important for the reproduction (e.g.: each run represents a step in a dataflow), please make sure to inform the correct reproduction order to whoever wants to replicate the experiment. This can also be obtained by running ``reprounzip graph``; please refer to :ref:`provenance-graph` for more information.

ReproZip can also combine multiple traces into a single one, in order to create a single bundle, using the ``reprozip combine`` command. The runs of each subsequent trace are simply appended in order.

Packing GUI and Interactive Tools
+++++++++++++++++++++++++++++++++

ReproZip is able to pack GUI tools. Additionally, there is no restriction in packing interactive experiments (i.e., experiments that require input from users). Note, however, that if entering something different can make the experiment load additional dependencies, the experiment will probably fail when reproduced on a different machine.

..  _packing-clientserv:

Capturing Connections to Servers
++++++++++++++++++++++++++++++++

When reproducing an experiment that communicates with a server, the experiment will try to connect to the same server, which may or may not fail depending on the status of the server at the moment of the reproduction. However, if the experiment uses a local server (e.g.: database) that the user has control over, this server can also be captured, together with the experiment, to ensure that the connection will succeed. Users should create a script to:

* start the server,
* execute the experiment, and
* stop the server,

and use *reprozip* to trace the script execution, rather than the experiment itself. In this way, ReproZip is able to capture the local server as well, which ensures that the server will be alive at the time of the reproduction.

For example, if you have an web app that uses MySQL and that runs until ``Ctrl+C`` is received, you can use the following script::

    #!/bin/sh

    if [ "$(id -u)" != 0 ]; then echo "This script needs to run as root so that it can execute MySQL" >&2; exit 1; fi

    # Start MySQL
    sudo -u mysql /usr/sbin/mysqld --pid-file=/run/mysqld/mysqld.pid &
    sleep 5

    # Don't exit the whole script on Ctrl+C
    trap ' ' INT

    # Execute actual experiment that uses the database
    ./manage.py runserver 0.0.0.0:8000

    trap - INT

    # Graceful shutdown
    /usr/bin/mysqladmin shutdown

Note the use of ``trap`` to avoid exiting the entire script when pressing ``Ctrl+C``, to make sure that the database gets shutdown via the next command.

Excluding Sensitive and Third-Party Information
+++++++++++++++++++++++++++++++++++++++++++++++

ReproZip automatically tries to identify log and temporary files, removing them from the bundle, but the configuration file should be edited to remove any sensitive information that the experiment uses, or any third-party file/software that should not be distributed. Note that the ReproZip team is **not responsible** for personal and non-authorized files that may get distributed in a package; users should double-check the configuration file and their package before sending it to others.

Identifying Output Files
++++++++++++++++++++++++

The `reprozip` component tries to automatically identify the main output files generated by the experiment during the ``trace`` command to provide useful interfaces for users during the unpacking step. However, if the experiment creates unique names for its outputs every time it is executed (e.g.: names with current date and time), the *reprounzip* component will not be able to correctly detect these; it assumes that input and output files do not have their path names changed between different executions. In this case, handling output files will fail. It is recommended that users modify their experiment (or use a wrapper script) to generate a symbolic link (with a fixed name) that always points to the latest result, and use that as the output file's path in the configuration file (under the ``inputs_outputs`` section).