Skip to content

Training dataset dumper tutorial#

Introduction#

In this tutorial you will learn to use the training-dataset-dumper, an essential tool to extract information for the training and evaluation of flavour tagging algorithms from data files in the ATLAS event data model.

The motivation for using the training-dataset-dumper is to decouple the ATLAS analysis software from the algorithm development, which mainly uses modern python tools which may behave awkwardly with the ATLAS and CERN ROOT environment.

The output of the dataset dumper are h5 files which store jet-related and track-related observables in arrays that can be processed using tools such as numpy or pandas, be used as input for neural networks defined with TensorFlow or PyTorch and be visualised using matplotlib. The main usage of the dumped h5 files (also called ntuples) is to provide input to the Umami framework for training and evaluation of flavour tagging algorithms.

In this tutorial you will learn how to:

  1. Fork, clone, and install the training-dataset-dumper.
  2. Run a test job and inspect the output.
  3. Modify the configuration files to disable the jet calibration and confirm expected change in output.
  4. Add/remove a jet variable in the configuration files and confirm expected change in output.
  5. Schedule a neural network to be evaluated during running the training-dataset-dumper and write the network scores to the output file.
  6. Open a merge request to training-dataset-dumper GitLab project to fix an issue or improve the documentation.
  7. Run the training-dataset-dumper on the grid and retrieve output file.

If you manage to do all these tasks, there are a few bonus exercises, prompting you to learn how to:

  1. Download a DxAOD file using rucio.
  2. Write a plotting script to display some variables stored in the h5 file using either ROOT or matplotlib.
  3. Write a plotting script to draw a ROC curve based on the scores in the h5 file of an evaluated flavour tagging algorithm.
  4. Change the track selection in the configuration files of the training-dataset-dumper, run with the modified selection and inspect the output.
  5. Manipulate h5 files, extract only one variable or only few events from an h5 file.

The tutorial is meant to be followed in a self-guided manner. You will be prompted to do certain tasks by telling you what the desired outcome will be, without telling you how to do it. Using the documentation of the training-dataset-dumper, you can find out how to achieve your goal. In case you are stuck, you can click on the "hint" toggle box to get a hint. If you tried for more than 10 min at a problem, feel free to toggle also the solution with a worked example.

In case you encounter some errors, please reach out on the training-dataset-dumper mattermost channel (click here to sign up) and open a merge request to fix the tutorial.

Prerequisites#

You need access to a shell on either CERN's lxplus or your local institute's machine with access to /cvmfs, so that you can setup the ATLAS software environment.

Alternatively, you can also run inside a container which provides the ATLAS software environment on your local computer. Below, instructions are provided for both cases. Please choose the appropriate one.

Prepare environment on lxplus

For following the tutorial session on lxplus, we recommend to connect to ssh lxplus.cern.ch. If you wish to work in your home directory, make sure that sufficient disk space is avaliable. The installation of the dumper needs at least 200 MB of free disk space. You can check the used quota and available disk space with fs quota.

Before you start with the tutorial, make sure that you are using a recent version of git. On lxplus, you can do that by setting up a version with lsetup.

setupATLAS
lsetup git

In addition, you need to download a sample file which will be processed by the training-dataset-dumper. We recommend that you store it on your private EOS storage (see details here), because it has a size of 237 MB. You can retrieve it with the following commands:

cd /eos/user/${USER:0:1}/${USER}/
mkdir -p ftag_tutorial/data && cd ftag_tutorial/data
wget https://umami-ci-provider.web.cern.ch/tutorial/DAOD_FTAG1.ttbar_tutorial.root
Prepare environment on local machine without cvmfs access (e.g. your laptop)

In case you want to work on a machine without access to cvmfs (the CERN Virtual Machine File System which distributes the ATLAS software), you can still follow the tutorial using a Docker container.

If you haven't done so already, install Docker Desktop and follow these installation instructions below. Note that these differ a little from the setup shown in the solution to the first task.

# authentificate by logging in to the CERN GitLab container registry with your CERN username and password
docker login gitlab-registry.cern.ch

# download training-dataset-dumper image
docker pull gitlab-registry.cern.ch/atlas-flavor-tagging-tools/training-dataset-dumper:main

# check out training-dataset-dumper project
mkdir tdd && cd tdd
git clone ssh://git@gitlab.cern.ch:7999/atlas-flavor-tagging-tools/training-dataset-dumper.git

In addition, you need to download a sample file which will be processed by the training-dataset-dumper. Note that it has a size of 237 MB and ensure that you have sufficient disk space. You can retrieve it with the following commands:

mkdir -p ftag_tutorial/data
cd ftag_tutorial/data
wget https://umami-ci-provider.web.cern.ch/tutorial/DAOD_FTAG1.ttbar_tutorial.root
cd -

Now you are ready to launch the docker container and compile the dataset dumper.

# launch docker container
# start docker container and mount current directory inside container
docker run --rm -it -v $PWD:/home/workdir --workdir /home/workdir gitlab-registry.cern.ch/atlas-flavor-tagging-tools/training-dataset-dumper:main

# compile code: no need to source a setup script with "asetup" inside of a docker container
mkdir build
cd build
cmake ../training-dataset-dumper
make
# add executables to system path
source x*/setup.sh
cd ..

Tutorial tasks#

1. Fork, clone, and install the training-dataset-dumper#

Before you can start with the other tasks, you need to do this one first. The expected outcome of this task is that you will have

  1. a personal fork of the training-dataset-dumper GitLab project,
  2. have cloned it to your work area on your machine using git,
  3. set up a development branch for the tutorial called my_tutorial_branch,
  4. successfully compiled it and set up the paths to be able to use it.

Go to the GitLab project page of the training-dataset-dumper to begin with the task: https://gitlab.cern.ch/atlas-flavor-tagging-tools/training-dataset-dumper/

Hint: how can I create a fork of a project?

In case you are stuck how to create your personal fork of the project, you can find some general information on git and the forking concept here in the GitLab documentation.

Hint: how can I clone and compile the project?

In case you are stuck and don't know what to do to retrieve the project code using git clone and how to compile it, have a look at the installation documentation.

Hint: how can I create a new branch?

You can create a new branch and change to it with git using the following command:

git checkout -b my_tutorial_branch

Solution

Open the website https://gitlab.cern.ch/atlas-flavor-tagging-tools/training-dataset-dumper/ in a browser. You may need to authentificate with your CERN login credentials. In the top right corner of the training-dataset-dumper project you see three buttons which show a bell (notifications), a star (to favourite the project) next to a number, and a forking graph (to fork the project) with the text "Fork" next to a number. Click on the word "Fork" to open a new website, allowing you to specify the namespace of your fork. Click on "Select a namespace", choose your CERN username, and create the fork by clicking on "Fork project".

Next, you need to clone the project using git. Open a fresh terminal on your workstation, create a new folder and proceed with the installation as instructed in the quickstart / the documentation, with the only difference that we will use your fork as the origin project. To do so, open your forked project in a browser. The address typically is https://gitlab.cern.ch/<your CERN username>/training-dataset-dumper. When clicking on the blue "Clone" button at the right hand-side of the page, a drop-down mini-page appears with the ssh path to the forked git project. Let's check out your personal fork.

mkdir tdd
cd tdd
git clone ssh://git@gitlab.cern.ch:7999/<your CERN username>/training-dataset-dumper.git
source training-dataset-dumper/setup/athanalysis.sh
mkdir build
cd build
cmake ../training-dataset-dumper
make
source x*/setup.sh
cd ..

As a result, you now have checked out and compiled your working copy of the training-dataset-dumper. Congratulations!

Now, set up a development branch for the tutorial.

cd training-dataset-dumper
git checkout -b my_tutorial_branch

For your convenience, it is a good idea to also attach the main project to the local copy obtained via git.

git remote add upstream ssh://git@gitlab.cern.ch:7999/atlas-flavor-tagging-tools/training-dataset-dumper.git

From now on, you can get the latest version of the main project in the atlas-flavor-tagging-tools group using git fetch upstream and push your changes in a new branch to your personal fork. You can do so, by pushing the branch my_tutorial_branch you created earlier to your fork using git push origin my_tutorial_branch.

2. Run a test job and inspect the output#

After successfully installing the training-dataset-dumper code, you will be in a position to run a test job using the package.

For this task, you will:

  1. Run a test job using the package test script.
  2. Inspect the output of the test job.
  3. Run another test job using a different configuration.
Hint: how can I run a test job?

The documentation includes instructions for running your first test job. Make sure to use the -h argument to gain an understanding of the different command line options.

Hint: I can't find the output of the test job?

Make sure to read the relevant section in the documentation carefully, and use the -h argument to see if you can find a argument for the test script that will specify the output location for the test job output.

Hint: how can I inspect the job's output?

Again, you can find useful information in the documentation.

h5ls is a good tool for getting started, and is included with your installation of the training-dataset-dumper package. Try using the -h argument to find out how you can use the tool. You can use h5diff to compare the output of two different jobs.

Solution

If you have not already set up an analysis release and the paths for the dataset-dumper, do so now:

source training-dataset-dumper/setup/athanalysis.sh
source build/x*/setup.sh
mkdir run
cd run

As specified in the documentation, you can use the test-dumper command to run a test job. This command takes one mandatory argument which specifies the input configuration for the test job. As mentioned in the docs, you can use pflow for your first test job, which will run a test job using the EMPFlow.json configuration file.

You need to use the -d optional argument to place the test job output in your working directory. So, after running

test-dumper -d testjob pflow
you should have a file named testjob/output.h5.

Next, run

h5ls -v testjob/output.h5
to list the contents of the job output. The -v argument produces more verbose output. Use -h to take a look at the other available arguments.

Finally, you should try running the test script with a different mandatory argument to pflow

test-dumper -d testjob_truth truth
And inspect the output again.

If you have time, you can try using h5py to open the output file using python, and inspect the contents. More detailed information about the use of h5py is covered in the bonus tasks.

3. Modify the configuration files to disable the jet calibration#

You are already familiar with how to run a test job and select different configurations. Now we will touch one of the configuration files which are stored in configs/ and compare the output of running the dataset dumper with the modified file to the output when using the original file. The modification we are about to make will deactivate the jet calibration when dumping the outputs to the file, resulting in using uncalibrated jet kinematic information when creating the output file. We will not employ the test file downloaded in test-dumper, which only contains 10 events. Instead, for better visualisation we will run over the tutorial sample file which contains 1000 events.

For this task, you will:

  1. Run over the tutorial sample using the configs/EMPFlow.json config file. Save the output to output_with_jet_calibration.h5.
  2. Modify the configs/EMPFlow.json config file by deactivating the jet calibration.
  3. Run over the tutorial sample again, using the modified config file.
  4. Compare the difference in the output files with and without jet calibration.

If you feel brave, you can write a simple python plotting script (see bonus tasks) to make a plot comparing the calibrated and uncalibrated jet momenta.

Hint: How do I run over the sample tutorial file?

You can process the tutorial sample using the dump-single-btag executable. Check the available options by running with the -h flag: one of the options is to name the output file. We assume that you have downloaded the tutorial sample as specified in the prerequisites to the path <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root.

dump-single-btag -c ../training-dataset-dumper/configs/EMPFlow.json <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root --out-file output_with_jet_calibration.h5
Hint: Tired of typing long path names?

You can create a symlink from your run directory to the configuration files with

ln -s ../training-dataset-dumper/configs/ configs
From now on, you can type configs/EMPFlow.json in place of ../training-dataset-dumper/configs/EMPFlow.json.

You are free to do the same with DAOD_FTAG1.ttbar_tutorial.root, which will shorten many of these commands to something like:

dump-single-btag -c configs/EMPFlow.json DAOD_FTAG1.ttbar_tutorial.root

Hint: Where do I find the tutorial sample the task is referring to?

The prerequisites section of this page explains how to download the DxAOD file with 1000 ttbar events which is used in this tutorial. It is hosted at https://umami-ci-provider.web.cern.ch/tutorial/DAOD_FTAG1.ttbar_tutorial.root.

Assuming you run on lxplus, we suggest that you download it to eos, using wget. You can retrieve it with the following commands:

cd /eos/user/${USER:0:1}/${USER}/
mkdir -p ftag_tutorial/data && cd ftag_tutorial/data
wget https://umami-ci-provider.web.cern.ch/tutorial/DAOD_FTAG1.ttbar_tutorial.root
Hint: How can I deactivate the jet calibration?

In the config file configs/EMPFlow.json, replace

"calibration": {
    "file": "fragments/pflow-calibration.json"
},

with

"calibration": {},

to deactivate the jet calibration and change the output of the dataset dumper to use uncalibrated jet observables.

Hint: how can I compare the content of the two h5 files?

First let's make sure you changed something. You can run h5ls on the outputs with and without jet calibration applied. You should see that they have a different number of jets, because we're applying a selection on p_{\rm T}, |\eta|, and JVT. If you delete the entries in the selection part of the calibration you should see the same number of jets in both cases.

While the h5ls script provides basic functionality to inspect the content of h5 files, more control is provided using python scripts using the h5py package. If you work on your private machine, you can simply install it with pip install h5py. If you work on an institute machine, you can either use a virtual environment (see here) or set up an LCG view, e.g. LCG view 101 which supports python3 and h5py.

source /cvmfs/sft.cern.ch/lcg/views/LCG_101/x86_64-centos7-clang12-opt/setup.sh

You can access the content of a h5 file with h5py with similar syntax as for python a dict.

A simple python script to print the jet pt which are stored in a h5 output file with the name output.h5 could look like the following code snippet:

from h5py import File

with File("output.h5", 'r') as h5file:
    jets = h5file['jets']
    print(jets['pt'])
Solution

If you have not already set up an analysis release and the paths for the dataset-dumper, do so now:

source training-dataset-dumper/setup/athanalysis.sh
source build/x*/setup.sh
cd training-dataset-dumper

Run the dataset dumper over the tutorial sample. Give the output file a non-default name so we don't overwrite it later!

dump-single-btag -c ../training-dataset-dumper/configs/EMPFlow.json <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root --out-file output_with_jet_calibration.h5

Open the config file in a text editor of your choice and replace in configs/EMPFlow.json

"calibration": {
    "file": "fragments/pflow-calibration.json"
},

with

"calibration": {},

to deactivate the jet calibration and save your changes.

Run the dataset dumper over the tutorial sample another time.

dump-single-btag -c ../training-dataset-dumper/configs/EMPFlow.json <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root

We will write a simple script to output the calibrated and uncalibrated jet pt, using the python library h5py. On lxplus or other environments, you can provide it using LCG views. On your private machine, you can install it using pip.

Set up the LCG view 101 which supports python3 and h5py.

source /cvmfs/sft.cern.ch/lcg/views/LCG_101/x86_64-centos7-clang12-opt/setup.sh

Save the following content as a python script called print_jet_pt.py.

from h5py import File

input_file = "output_with_jet_calibration.h5"
input_file_raw = "output.h5"

with File(input_file, 'r') as h5file:
    jets = h5file['jets']
    print(jets['pt'])

with File(input_file_raw, 'r') as h5file_raw:
    jets_raw = h5file_raw['jets']
    print(jets_raw['pt'])

Execute the python script and compare the print-outs.

python3 print_jet_pt.py

4. Add/remove a jet variable in the configuration files#

Now that you are familiar with modifying the configuration files of the dataset dumper, we will modify the lists of jet and track variables scheduled to be written to the output file.

For this task, you will:

  1. Run the dataset dumper with the default config file to produce a reference output h5 file,
  2. Open the configuration file and remove the jet kinematic information, as well as the jet flavour label from the scheduled list of output variables.
  3. Run the dataset dumper with the modified config file to produce a second output h5 file.
  4. Inspect both h5 files and compare their content.
Hint: where do I find the config file where the output variables are defined?

You can find the configuration files in the directory configs. Information on their structure is provided in the documentation.

Have a look both at the config file you are using and the fragments it includes, which reside in the configs/fragments directory.

Hint: how can I find the corresponding names in the EDM of the jet kinematic information and jet flavour labels?

A comprehensive overview of all variables currently being dumped from a reference file is provided in the documentation, together with some explanation about their meaning. The kinetic properties of jets are typically encoded as four-vectors, p = (energy, pt, eta, phi). Because of the cylindrical symmetry of the ATLAS detector, we neglect the component phi in the training. For finding the variable which labels the jet flavour, use the browser's search function for the word jet label on that page.

Solution

If you have not already set up an analysis release and the paths for the dataset-dumper, do so now:

source training-dataset-dumper/setup/athanalysis.sh
source build/x*/setup.sh
cd training-dataset-dumper

Run the dataset dumper over the tutorial sample.

dump-single-btag -c ../training-dataset-dumper/configs/EMPFlow.json <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root

Inspect the content of the output file.

h5ls -v output.h5 | tee content_before_modification.txt

The variables which you need to remove from the config file is:

  • "pt"
  • "eta"
  • "energy"
  • "HadronConeExclTruthLabelID"

Open the config file configs/EMPFlow.json in your favourite text editor and inspect it. The desired variables are not directly listed but are listed in one of the config fragments that are included. You need to also open configs/fragments/pflow-variables.json. In this file, you can find the variables listed above. When you remove them, pay close attention to the , at the end of the lines to still have a valid json file. Learn about the json structure here. Save the modified file now.

With the modified file, run the dumper again over the tutorial sample.

dump-single-btag -c ../training-dataset-dumper/configs/EMPFlow.json <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root

Inspect the content of the new output file (note that the previous output.h5 file got overwritten).

h5ls -v output.h5 | tee content_after_modification.txt

Compare the two text dumps of the output file content.

diff content_before_modification.txt content_after_modification.txt

You should see that the variables which you removed from the config file are missing from the new output file.

5. Schedule evaluation of a neural network when running the training-dataset-dumper#

The training-dataset-dumper can not only dump the content of a DxAOD file and convert it to an h5 file, it is also possible to schedule neural networks provided in the lwtnn format via the DL2 interface. Similarly, it is possible to schedule neural networks provided in the onnx format via the same interface.

Although it may sound frightening at first, scheduling a neural network is in fact very easy and can be achieved with the AnalysisBase release with the existing setup of the dataset-dumper, which you already used during the previous exercises. The only action required from your side is another modification of a config file.

For this task, you will:

  1. Read the documentation how to add a tagger in onnx format.
  2. Identify the path to the GN2v00 tagger trained on r22 p5169 with 192M jets using the overview page of available taggers.
  3. Add the GN2v00 tagger to the "dl2_configs" entry in the config file configs/EMPFlow.json. Make sure to rename the outputs using the "remapping" keyword as follows:
        "GN2v00_pu": "GN2v00_tutorial_pu",
        "GN2v00_pc": "GN2v00_tutorial_pc",
        "GN2v00_pb": "GN2v00_tutorial_pb"
    
  4. Run the dataset dumper with the modified config file to produce a second output h5 file.
  5. Inspect the output h5 file and look at its content. Are there the entries GN2v00_tutorial_pu, GN2v00_tutorial_pc, GN2v00_tutorial_pb?
Hint: How can I schedule taggers in the config file?

The documentation includes instructions for filling out the dl2_configs entry in the config file. Alternatively, you can take a look at other config files which already make use of scheduling neural networks as part of the dataset-dumper.

Solution

Open the config file configs/EMPFlow.json in your favourite text editor and inspect it. Replace the line

"dl2_configs": [
    {
        "nn_file_path": "BTagging/20231205/GN2v01/antikt4empflow/network_fold0.onnx",
        "engine": "gnn",
        "decorate_tracks": true
    }
],

with

"dl2_configs": [
    {
        "nn_file_path": "BTagging/20231205/GN2v01/antikt4empflow/network_fold0.onnx",
        "engine": "gnn",
        "decorate_tracks": true
    },
    {
        "nn_file_path": "BTagging/20230306/gn2v00/antikt4empflow/network.onnx",
        "engine": "gnn",
        "remapping": {
            "GN2v00_pu": "GN2v00_tutorial_pu",
            "GN2v00_pc": "GN2v00_tutorial_pc",
            "GN2v00_pb": "GN2v00_tutorial_pb"
        }
    }
],

and save your changes. The jets are now evaluated using this neural network. To save the outputs also in the h5 files, you also need to add them in the floats:

"variables": {
    "file": "fragments/single-btag-variables.json",
    "btagging": {
        "floats": [
            "GN2v00_pb",
            "GN2v00_pc",
            "GN2v00_pu",
            "GN2v00_tutorial_pu",
            "GN2v00_tutorial_pc",
            "GN2v00_tutorial_pb"
        ]
    },
}

With the modified file, run the dumper over the tutorial sample.

dump-single-btag -c ../training-dataset-dumper/configs/EMPFlow.json <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root

Inspect the content of the output file and look for the output variables GN2v00_tutorial_pu, GN2v00_tutorial_pc, GN2v00_tutorial_pb.

h5ls -v output.h5 | grep GN2v00_tutorial

6. Open a merge request#

The training-dataset dumper is actively developed. Please have a look at the development guidelines. Discussion about the latest developments takes place via GitLab issues in the main project and in the mattermost channel (click [here] to sign up for the ATLAS FTAG mattermost team). Furthermore, there is the FTAG discourse platform using threads to organise discussion topics.

The goal of this task is to make you familiar with the GitLab workflow of opening merge requests to add code changes. A prerequisite for this task is that you have created a fork of the project in the first task of the tutorial.

For this task, you will:

  1. Identify an issue to address with a merge request in the list of open issues.
  2. Create a new branch, then modify the code locally, commit your changes with a descriptive commit message locally and push the new branch to your fork.
  3. Create a merge request of the fork to the main project using the GitLab web interface.
Hint: Where can I find more information on how to use git and GitLab?

Several resources on using git for version control exist. Please refer to this collection of useful resources.

Solution

We assume that you have created a fork of the project in the first task of the tutorial and will use the same names for your fork and the main project as in the solution to the first task. That is, the fork is origin and the main project is upstream.

Choose an issue from the list of open issues to address. We will assume for the solution that you want to improve the documentation.

Create a new branch. For the sake of this solution we assume that we call it improved_documentation. You can of course choose a different name which provides a brief description of your planned modification.

git checkout -b improved_documentation

Pull the latest changes from the main project main branch (which is called main).

git pull upstream main

With your favourite text editor, carry out the planned modifications and commit your changes with a descriptive commit message.

git commit -m "improve documentation"

Push your changes to your personal fork.

git push origin improved_documentation

In the text appearing in your console, you will see a link to a GitLab webpage. Follow that link to directly open a merge request. Then, fill out the form and submit the merge request.

7. Run the training-dataset-dumper on the grid#

While the previous tasks target local development and testing configurations, the typical use-case of the training-dataset-dumper is to process large files on the LHC computing grid. For running the training-dataset-dumper on the grid, we first need a grid certificate. Please refer to this page for further information. We assume that you have the grid certificate downloaded and ready to use.

The dataset we want to process with the training-dataset-dumper is defined in the sample list for grid jobs BTagTrainingPreprocessing/grid/inputs/single-btag.txt.

We will just subit a job for the first sample in that list which starts with mc23_13p6TeV.601229.PhPy8EG_A14_ttbar_hdamp258p75_SingleLep.deriv.DAOD_FTAG1.

For this task, you will:

  1. Set up the dataset dumper for submitting to the grid.
  2. Find the input file, and comment out all but the above entry
  3. Schedule an available tagger by adding it to the config file used by the grid-submit single-btag submission script, see configs/EMPFlow.json.
  4. Commit these changes to your development branch.
  5. Dry run the submission process to test if everything works.
  6. Submit the dataset to the grid and tag it with Test.
  7. Check if you have successfully submitted job using the BigPanDA website and monitor its progress.
  8. After the job has finished, retrieve the output h5 file from the grid using rucio.

When submitting to the grid, the dataset dumper will automatically take a snapshot of your current setup by creating a git commit and push a tag to the repository. Make sure, that you have no un-commited changes to your files before running the grid submission script.

Hint: Where do I find information about running dataset dumper on the grid?

If you have trouble finding the correct setup file, have a look here: Grid Dumps documentation.

Solution: Set up the dataset dumper for submitting to the grid.

First you need to setup the required software (similar as for local usage).

source training-dataset-dumper/setup/athanalysis.sh
source build/x*/setup.sh

Now, we need to prepare the grid setup:

source training-dataset-dumper/BTagTrainingPreprocessing/grid/setup.sh

If you are working on lxplus, this should work instantly. If you are working on a cluster with cvmfs access, you might need to run lsetup emi before sourcing the grid setup.

Hint: In which directory can I find the correct submission script and the text file with the sample to be submitted?

Check the training-dataset-dumper/BTagTrainingPreprocessing/grid folder. Look inside the file inputs/single-btag.txt and search for the input datasets.

Hint: In which config file can I schedule an additional tagger and where?

Look in the directory with config files and search for the EMPFlow.json file. Open it with your favourite text editor and look out for the dl2_configs entry.

Solution: Schedule an available tagger by adding it to the config file used by the grid-submit single-btag submission script.

You need to add the tagger as a DL2 entry. A nice explanation how this is done is given here

Hint: Where do I find information on dry running the grid submission?

Try to run

grid-submit -h

and check the options for the dry run and the tag option.

Solution: Dry run the submission process to test if everything works.

The dry run (without actual submission) can be started with

grid-submit -d -t Test single-btag

This will dry run the submission process without actually submitting the datasets defined in INPUT_DATASETS.

Hint: Where do I find information on how to tag the output when submitting to the grid?

Try to run

grid-submit -h

and check the options for the tag option.

Hint: How can I just dump a small part of a sample?

Ensure rucio is setup

lsetup rucio

Then select the number of events to dump via

grid-submit -c {config} -i {inputs} -n 10000 single-btag

This will select the number of files in each input container required to get at least 10,000 events. If each file has 20,000 events, then asking for 10,000 will run a single file, and produce 20,000.

Hint: I get an error message that git remote get-url is not avaliable. What can I do about it?

The functionality git remote get-url only becomes avaliable in git versions above 1.8.3.1. If you work on lxplus, you can get a recent version of git with lsetup.

setupATLAS
lsetup git
Solution: Submit the dataset to the grid and tag it with Test.

To actual submit the dataset:

grid-submit -t Test single-btag

This will submit the samples and also create a tag of the current version of the the local version of the dataset dumper to your fork.

Solution: Check if you have successfully submitted job using the BigPanDA website and monitor its progress.

You can find your submitted job on the BigPanDA webpage. Click at the top on My BigPanDA and scroll down. You should be able to see your job there. If not, wait for a few seconds and click at the top right on Refresh.

Solution: After the job has finished, retrieve the output h5 file from the grid using rucio.

To retrieve the finalised file from rucio, we first need to setup rucio. This can be done by running:

setupATLAS
localSetupRucioClients

If you are not running on lxplus, you need a cluster with cvmfs access. In some cases you need to run lsetup emi and voms-proxy-init -voms atlas before running localSetupRucioClients.

Now you need the name of container/dataset. This can be retrieved from the job BigPanDA job. Scroll down on your BigPanDA page and click on the task name. Now scroll a bit down and you will find the Containers part with the input and the output. Normally two outputs are provided: the log files of the job (they have the ending .log) and the real output files (they have the ending _output.h5). The container/dataset name that we are searching is the full name with the _output.h5 at the end.

After that, you can simply download your file from the grid with rucio -v download <dataset-name>. Please keep in mind that the files produced by the grid are stored on scratch disks and are erased after two weeks. To keep them save, you need to copy them to a local disk with the Rucio web interface. Create a rule there and add your dataset/container ID and save it on a local disk where you have write access. After submitting the rule, your dataset is copied automatically to the disk.

Bonus tasks#

Download a DxAOD file using rucio#

The typical use-case of the training-dataset dumper is to process the large DxAOD dataset containers on the grid. However, for local testing it is often useful to download single DxAOD files and process them locally for development. Retrieval of files stored in the computing grid infrastructure is possible using the rucio tool.

You can find documentation on its usage in the ATLAS software tutorial.

You can find information on recommended DxAOD MC samples in the FTAG algorithm docs MC sample page.

For this task, you will:

  1. Identify the dataset container name of a PHYSVAL DxAOD ttbar sample for the mc20d MC campaign.
  2. Download a single file from that container.
Solution

Open the website https://ftag-docs.docs.cern.ch/samples/samples/ and search for the ttbar sample entry for the MC campaign mc20d. Scrolling to the right, you should see an entry in the DAOD column, e.g. mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.deriv.DAOD_FTAG1.e6337_s3681_r13144_p5627 (might be a different file at the time you are looking at this tutorial: in this case use the name listed on the webpage).

Note that you can only do the following task (within the scope of this tutorial) on a machine with /cvmfs access, e.g. on CERN's lxplus machines.

In a shell, set up the ATLAS software, set up rucio and initialise your grid proxy.

setupATLAS
letup rucio
voms-proxy-init -voms atlas

Next, download one random file from the dataset container

rucio download --nrandom 1 mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.deriv.DAOD_FTAG1.e6337_s3681_r13144_p5627

Inspect the content of a DxAOD file#

Having a list of all possible input variables in an DxAOD file is often very useful.

For this task, you will dump a list of all objects and variables stored in the DxAOD to a text file. To achieve this, you can make use of the checkFile.py script which is provided in the AthAnalysis and Athena releases.

Solution

Set up the dumper using an Athena and not an AnalysisBase release.

source training-dataset-dumper/setup/athena.sh

Run checkFile.py on the DxAOD file of which you want to dump the file content. We assume you are running on the tutorial sample file stored in `/ftag_tutorial/data/.

checkFile.py -d <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root

The -d command line option prompts the script to provide a detailled dump.

It can be useful to store the dump in a text file.

checkFile.py -d <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root | tee dxaod_content.txt

I can also be useful to employ the command line tool grep to filter out certain information. Below is an example to only show variables including SV1.

checkFile.py -d <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root | grep SV1

Write a plotting script to display variables in h5 file#

With the output h5 files of the training-dataset-dumper, it is easy to create plots of variables in the python ecosystem using h5py to process the h5 files and the packages numpy + matplotlib to create the plots.

You need to install the packages h5py, numpy, and matplotlib for this task. On your private machine you can do this using pip. On lxplus or your institute's machine, you can either use virtual environment (see here) or set up an LCG view, e.g. LCG view 101 which supports python3 and h5py.

Your task is to plot a histogram of the pt distribution of jets in the output.h5 file created with the training-dataset dumper when processing the tutorial MC sample.

Solution

We assume you are working on lxplus and are using LCG views to provide the required python packages. Set up the LCG view 101 which supports python3 and h5py.

source /cvmfs/sft.cern.ch/lcg/views/LCG_101/x86_64-centos7-clang12-opt/setup.sh

We assume that you still have on output file of running the dumper from a previous task, which is called output.h5 and which is located in the training-dataset-dumper directory.

Save the following content as a python script called plot_jet_pt.py.

from h5py import File
import matplotlib.pyplot as plt

input_file = "output.h5"

with File(input_file, 'r') as h5file:
    jets = h5file['jets']
    fig, ax = plt.subplots()
    ax.hist(jets['pt'])
    ax.set_xlabel('jet pt [MeV]')
    ax.set_ylabel('Entries')
    fig.savefig('jet_pt.png')

Execute the python script and compare the print-outs.

python3 plot_jet_pt.py

The resulting plot is stored in jet_pt.png. You can take a look at it using display.

display jet_pt.png

In case that you want to improve the layout and style of the plot, consider having a look at the package mplhep which provides the ATLAS plot style.

Write a plotting script to draw a ROC curve#

Similarly as in the previous task, it is not difficult to use the information in the output h5 files of the training-dataset-dumper to evaluate the performance of taggers directly with a simple python plotting script. A prerequisite is that the taggers have been scheduled either as part of the derivation task which produced the DxAOD file used as input or while dumping the h5 file (see task 5 in this tutorial).

You need to install the packages h5py, numpy, and matplotlib for this task. On your private machine you can do this using pip. On lxplus or your institute's machine, you can either use virtual environment (see here) or set up an LCG view, e.g. LCG view 101 which supports python3 and h5py.

Your task is to plot a ROC curve showing the b-tag efficiency vs light-jet rejection of the DL1r tagger whose scores are already provided in the h5 file as output variables.

Solution

We assume you are working on lxplus and are using LCG views to provide the required python packages. Set up the LCG view 101 which supports python3 and h5py.

source /cvmfs/sft.cern.ch/lcg/views/LCG_101/x86_64-centos7-clang12-opt/setup.sh

We assume that you still have on output file of running the dumper from a previous task, which is called output.h5 and which is located in the training-dataset-dumper directory. We will plot the ROC curves of the GN2v00 tagger.

Save the following content as a python script called plot_ROC.py.

from h5py import File
import numpy as np
import matplotlib.pyplot as plt

input_file = "output.h5"
tagger = 'GN2v00'
flavours = {'b': 5, 'c': 4, 'u': 0}

with File(input_file, 'r') as h5file:
    jets = h5file['jets']
    select_b = (jets['HadronConeExclTruthLabelID'] == flavours['b'])
    select_u = (jets['HadronConeExclTruthLabelID'] == flavours['u'])

    flav_b = {f:jets[select_b][f'{tagger}_p{f}'] for f in 'cub'}
    flav_u = {f:jets[select_u][f'{tagger}_p{f}'] for f in 'cub'}

    # compute discriminants
    fc = 0.018
    discrim_b = np.log(flav_b['b'] / (fc * flav_b['c'] + (1-fc) * flav_b['u']))
    discrim_u = np.log(flav_u['b'] / (fc * flav_u['c'] + (1-fc) * flav_u['u']))

    # turn into histogram
    infar = np.array([np.inf])
    edges = np.concatenate([-infar,np.linspace(-20,20,1000),infar])
    h_b = np.histogram(discrim_b, edges)[0]
    h_u = np.histogram(discrim_u, edges)[0]

    # make plot
    fig, ax = plt.subplots()
    beff = h_b[::-1].cumsum() / h_b.sum()
    ueff = h_u[::-1].cumsum() / h_u.sum()
    valid = (beff > 0.5) & (ueff > 0)
    rej = 1/ueff[valid]
    eff = beff[valid]
    ax.plot(eff, rej)

    ax.set_yscale('log')
    ax.set_xlabel(r'$b$ Efficiency')
    ax.set_ylabel(r'Light jet rejection')
    fig.savefig('jet_ROC_DL1r.png')

Execute the python script and create the ROC curve.

python3 plot_ROC.py

The resulting plot is stored in jet_ROC_DL1r.png. You can take a look at it using display.

display jet_ROC_DL1r.png

In case that you want to improve the layout and style of the plot, consider having a look at the package mplhep which provides the ATLAS plot style.

Manipulate h5 files#

You can manipulate h5 files using python scripts which employ h5py.

Have a look at these git projects and the scripts which they host:

  1. hdf5_manipulator (GitHub)
  2. Manuel's improved version of hdf5_manipulator (GitLab)