Using reprozip

The reprozip component is responsible for packing an experiment, which is done in three steps: tracing the experiment, editing the configuration file (if necessary), and creating the reproducible package. Each of these steps is explained in more details below. Please note that reprozip is only available for Linux distributions.

Tracing an Experiment

First, reprozip needs to trace the operating system calls used by the experiment, so as to identify all the necessary information for its future re-execution, such as binaries, files, library dependencies, and environment variables.

The following command is used to trace a command line, or a run, used by the experiment:

$ reprozip trace <command-line>

where <command-line> is the command line. By running this command, reprozip executes <command-line> and uses ptrace to trace all the system calls issued, storing them in an SQLite database.

If you run the command multiple times, reprozip might ask you if you want to continue with your current trace (append the new command-line to it) or replace it (throw away the previous command-line you traced). You can skip this prompt by using either the --continue or --overwrite flag, like this:

$ reprozip trace --continue <command-line>

Note that the final package will be able to reproduce any of the runs, and files shared by multiple runs are only stored once.

By default, if the operating system is based on Debian or RPM packages (e.g.: Ubuntu, CentOS, Fedora, ...), reprozip will also try to automatically identify the distribution packages from which the files come, using the available package manager of the system. This is useful to provide more detailed information about the dependencies, as well as to further help when reproducing the experiment. However, note that the trace command can take some time doing that after the experiment finishes, depending on the number of file dependencies that the experiment has. To disable this feature, users may use the flag --dont-identify-packages:

$ reprozip trace --dont-identify-packages <command-line>

The database, together with a configuration file (see below), are placed in a directory named .reprozip-trace, created under the path where the reprozip trace command was issued.

Editing the Configuration File

The configuration file, which can be found in .reprozip-trace/config.yml, contains all the information necessary for creating the experiment package. This file is generated by the tracer and drives the packing step.

It is very likely that you won’t need to modify this file, as the automatically-generated one should be sufficient to create a working package. However, in some cases, you may want to edit it prior to the creation of the package to add or remove files used by your experiment. This can be particularly useful, for instance, to remove big files that can be obtained elsewhere when reproducing the experiment, to keep the size of package small, and also to remove sensitive information that the experiment may use. The configuration file can also be used to edit the main command line, to add or remove environment variables, and to edit information regarding input/output files.

The first part of the configuration file gives general information with respect to the experiment and its runs, including command lines, environment variables, working directory, and machine information. Also, each run has a unique identifier (given by id) that is consistently used for packing and unpacking purposes:

# Run info
version: <reprozip-version>
runs:
# Run 0
- id: <run-id>
  architecture: <machine-architecture>
  argv: <command-line-arguments>
  binary: <command-line-binary>
  distribution: <linux-distribution>
  environ: <environment-variables>
  exitcode: <exit-code>
  gid: <group-id>
  hostname: <machine-hostname>
  system: <system-kernel>
  uid: <user-id>
  workingdir: <working-directory>

# Run 1
- id: ...
...

If necessary, users may change command line parameters by editing argv, and add or remove environment variables by editing environ. Users may also give a more meaningful and user-friendly identifier for a run by changing id. Other attributes should not be changed in general.

The next section brings information about input and output files, including their original paths and which runs read and/or wrote them. These are the files that reprozip identified as the main input or result of the experiment, which reprounzip will later be able to replace and extract from the experiment when reproducing it. You may add, remove, or edit these files in case reprozip fails in recognizing any important information, as well as give meaningful names to them by editing name:

# Inputs are files that are only read by a run; reprounzip can replace these
# files on demand to run the experiment with custom data.
# Outputs are files that are generated by a run; reprounzip can extract these
# files from the experiment on demand, for the user to examine.
# The name field is the identifier the user will use to access these files.
inputs_outputs:
  - name: <file-identifier>
    path: <path-to-file>
    read_by_runs: <run-ids>
    written_by_runs: <run-ids>
  - name: ...
  ...

Note that you can prevent reprozip from identifying which files are input or output by using the --dont-find-inputs-outputs flag in the reprozip trace command.

Note

To visualize the dataflow of the experiment, pleaser refer to Visualizing the Provenance Graph.

The next section in the configuration file lists all the files to be packed. If the software dependencies were identified by the package manager of the system during the reprozip trace command, they will be organized in software packages and listed under packages; otherwise, file dependencies will be listed under other_files:

packages:
  - name: <package-name>
    version: <package-version>
    size: <package-size>
    packfiles: <include-package>
    files:
      # Total files used: <used-files-size>
      # Installed package size: <package-size>
      <files-list>
  - name: ...
  ...

other_files:
  <files-list>

The attribute packfiles can be used to control whether a software package will be packed: its default value is true, but users may change it to false to inform reprozip that the corresponding software package should not be included. To remove a file that was not identified as part of a package, users can simply remove it from the list under other_files.

Warning

Note that if a software package is requested not to be included, the reprounzip component will try to install it from a package manager when unpacking the experiment. If the software version from the package manager is different from (and incompatible with) the one used by the experiment, the experiment may not be reproduced correctly.

Last, users may add file patterns under additional_patterns to include other files that they think it will be useful for a future reproduction. As an example, the following would add everything under /etc/apache2/ and all the Python files of all users from LXC containers (contrived example):

additional_patterns:
  - /etc/apache2/**
  - /var/lib/lxc/*/rootfs/home/**/*.py

Note that users can always reset the configuration file to its initial state by running the following command:

$ reprozip reset

Warning

When editing a configuration file, make sure your changes are as restrictive as possible, modifying only the necessary information. Removing important information and changing the structure of the file may cause issues while creating the package or unpacking the experiment.

Creating a Package

After tracing all the runs from the experiment and optionally editing the configuration file, the experiment package can be created by using the following command:

$ reprozip pack <package-name>

where <package-name> is the name given to the package. This command generates a .rpz file in the current directory, which can then be sent to others so that the experiment can be reproduced. For more information regarding the unpacking step, please see Using reprounzip.

Note that, by using reprozip pack, files will be copied from your environment to the package; as such, you should not change any file that the experiment used before packing it, otherwise the package will contain different files from the ones the experiment used when it was originally traced.

Warning

Before sending your package to others, it is advisable to test it and ensure that the reproduction of the experiment works.

Further Considerations

Packing Multiple Command Lines

As mentioned before, ReproZip allows multiple runs (i.e., command lines) to be traced and included in the same package. Alternatively, users can create a simple script that runs all the command lines, and pass that to reprozip trace. However, in this case, there will be no flexibility in choosing a single run to be reproduced, since the entire script will be re-executed.

Note that this flexibility has the caveat that users may reproduce the runs in a different order than the one originally used while tracing. If the order is important for the reproduction (e.g.: each run represents a step in a dataflow), please make sure to inform the correct reproduction order to whoever wants to replicate the experiment. This can also be obtained by running reprounzip graph; please refer to Creating a Provenance Graph for more information.

ReproZip can also combine multiple traces into a single one, in order to create a single package, using the reprozip combine command. The runs of each subsequent trace are simply appended in order.

Packing GUI and Interactive Tools

ReproZip is able to pack GUI tools. Additionally, there is no restriction in packing interactive experiments (i.e., experiments that require input from users). Note, however, that if entering something different can make the experiment load additional dependencies, the experiment will probably fail when reproduced on a different machine.

Capturing Connections to Servers

When reproducing an experiment that communicates with a server, the experiment will try to connect to the same server, which may or may not fail depending on the status of the server at the moment of the reproduction. However, if the experiment uses a local server (e.g.: database) that the user has control over, this server can also be captured, together with the experiment, to ensure that the connection will succeed. Users should create a script to:

  • start the server,
  • execute the experiment, and
  • stop the server,

and use reprozip to trace the script execution, rather than the experiment itself. In this way, ReproZip is able to capture the local server as well, which ensures that the server will be alive at the time of the reproduction.

For example, if you have an web app that uses PostgreSQL and that runs until Ctrl+C is received, you can use the following script:

#!/bin/sh

/etc/init.d/postgresql start        # Start PostgreSQL

trap ' ' INT                        # Don't exit the whole script on Ctrl+C
./manage.py runserver 0.0.0.0:8000
trap - INT

/etc/init.d/postgresql stop         # Stop PostgreSQL

Note the use of trap to avoid exiting the entire script when pressing Ctrl+C, to make sure that the database gets shutdown via the next command.

Excluding Sensitive and Third-Party Information

ReproZip automatically tries to identify log and temporary files, removing them from the package, but the configuration file should be edited to remove any sensitive information that the experiment uses, or any third-party file/software that should not be distributed. Note that the ReproZip team is not responsible for personal and non-authorized files that may get distributed in a package; users should double-check the configuration file and their package before sending it to others.

Identifying Output Files

The reprozip component tries to automatically identify the main output files generated by the experiment during the trace command to provide useful interfaces for users during the unpacking step. However, if the experiment creates unique names for its outputs every time it is executed (e.g.: names with current date and time), the reprounzip component will not be able to correctly detect these; it assumes that input and output files do not have their path names changed between different executions. In this case, handling output files will fail. It is recommended that users modify their experiment (or use a wrapper script) to generate a symbolic link (with a fixed name) that always points to the latest result, and use that as the output file’s path in the configuration file (under the inputs_outputs section).