Training Dataset Dumper Tutorial#
Introduction#
In this tutorial you will learn to use the Training-Dataset-Dumper (TDD), an essential tool to extract information for the training and evaluation of flavour tagging algorithms from Data/MC files in the ATLAS event data model.
The motivation for using the TDD is to decouple the ATLAS analysis software from the algorithm development, which mainly uses modern python tools which may behave awkwardly with the ATLAS and CERN ROOT environment.
The output of the dataset dumper are h5
files which store jet-related and track-related observables in arrays that can be processed using tools such as numpy
, be used as input for neural networks defined with TensorFlow
or PyTorch
and be visualised using matplotlib
. The main usage of the dumped h5
files (also called ntuples) is to provide input to the training pipeline for training and evaluation of flavour tagging algorithms.
In this tutorial you will learn how to:
- Fork, clone, and install the TDD.
- Run a test job and inspect the output.
- Modify the configuration files to disable the jet calibration and confirm expected change in output.
- Add/remove a jet variable in the configuration files and confirm expected change in output.
- Schedule a neural network to be evaluated during running the TDD and write the network scores to the output file.
- Open a merge request to TDD GitLab project to fix an issue or improve the documentation.
- Run the TDD on the grid and retrieve output file.
If you manage to do all these tasks, there are a few bonus exercises, prompting you to learn how to:
- Download a DxAOD file using
rucio
. - Write a plotting script to display some variables stored in the h5 file using either
ROOT
ormatplotlib
. - Change the track selection in the configuration files of the TDD, run with the modified selection and inspect the output.
- Manipulate h5 files, extract only one variable or only few events from an h5 file.
The tutorial is meant to be followed in a self-guided manner. You will be prompted to do certain tasks by telling you what the desired outcome will be, without telling you how to do it. Using the documentation of the TDD, you can find out how to achieve your goal. In case you are stuck, you can click on the "hint" toggle box to get a hint. If you tried for more than 10 min at a problem, feel free to toggle also the solution with a worked example.
In case you encounter some errors, please reach out on the TDD mattermost channel (click here to sign up) and open a merge request to fix the tutorial.
Prerequisites#
You need access to a shell on either CERN's lxplus
or your local institute's machine with access to /cvmfs
, so that you can setup the ATLAS software environment.
Alternatively, you can also run inside a container which provides the ATLAS software environment on your local computer. Below, instructions are provided for both cases. Please choose the appropriate one.
Prepare the environment on lxplus
For following the tutorial session on lxplus, we recommend to connect to ssh lxplus.cern.ch
.
We recommend to use your EOS space for this tutorial. There you have 1 Tb of space available.
You can switch there using:
cd /eos/user/${USER:0:1}/${USER}/
If you wish to work in your home directory, make sure that sufficient disk space is avaliable.
The installation of the dumper needs at least 200 MB of free disk space.
You can check the used quota and available disk space with fs quota
.
Before you start with the tutorial, make sure that you are using a recent version of git
.
On lxplus
, you can do that by setting up a version with lsetup
.
setupATLAS
lsetup git
In addition, you need to download a sample file which will be processed by the TDD. You can retrieve it with the following commands:
cd /eos/user/${USER:0:1}/${USER}/
mkdir -p ftag_tutorial/data && cd ftag_tutorial/data
curl -s https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/dumper-test-files/-/raw/main/p6859/DAOD_FTAG1.601589.e8549_s4162_r14622_p6859.small.pool.root > DAOD_FTAG1.ttbar_tutorial.root
Prepare environment on local machine without cvmfs access (e.g. your laptop)
In case you want to work on a machine without access to cvmfs (the CERN Virtual Machine File System which distributes the ATLAS software), you can still follow the tutorial using a Docker container.
If you haven't done so already, install Docker Desktop and follow these installation instructions below. Note that these differ a little from the setup shown in the solution to the first task.
# authentificate by logging in to the CERN GitLab container registry with your CERN username and password
docker login gitlab-registry.cern.ch
# download TDD image (You can find the latest stable version in use for the TDD in the .gitlab-ci.yml)
docker pull gitlab-registry.cern.ch/atlas/athena/athanalysis:25.2.59
# check out TDD project
mkdir tdd && cd tdd
git clone ssh://git@gitlab.cern.ch:7999/atlas-flavor-tagging-tools/training-dataset-dumper.git
In addition, you need to download a sample file which will be processed by the TDD. You can retrieve it with the following commands:
mkdir -p ftag_tutorial/data && cd ftag_tutorial/data
curl -s https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/dumper-test-files/-/raw/main/p6859/DAOD_FTAG1.601589.e8549_s4162_r14622_p6859.small.pool.root > DAOD_FTAG1.ttbar_tutorial.root
cd -
Now you are ready to launch the docker container and compile the dataset dumper.
# launch docker container
# start docker container and mount current directory inside container
docker run --rm -it -v $PWD:/home/workdir --workdir /home/workdir gitlab-registry.cern.ch/atlas-flavor-tagging-tools/training-dataset-dumper:main
# compile code: no need to source a setup script with "asetup" inside of a docker container
mkdir build
cd build
cmake ../training-dataset-dumper
make
# add executables to system path
source x*/setup.sh
cd ..
Tutorial Tasks#
1. Fork, Clone, and Install the TDD#
Before you can start with the other tasks, you need to do this one first. The expected outcome of this task is that you will have
- A personal fork of the
Training-Dataset-Dumper
GitLab project, - Have cloned it to your work area on your machine using
git
, - Set up a development branch for the tutorial called
my_tutorial_branch
, - Successfully compiled it and set up the paths to be able to use it.
Go to the GitLab project page of the TDD to begin with the task: https://gitlab.cern.ch/atlas-flavor-tagging-tools/training-dataset-dumper/
Hint: How can I create a fork of a project?
In case you are stuck how to create your personal fork of the project, you can find some general information on git and the forking concept here in the GitLab documentation.
Hint: How can I clone and compile the project?
In case you are stuck and don't know what to do to retrieve the project code using git clone
and how to compile it, have a look at the installation documentation.
Hint: How can I create a new branch?
You can create a new branch and change to it with git
using the following command:
git checkout -b my_tutorial_branch
Solution
Open the website https://gitlab.cern.ch/atlas-flavor-tagging-tools/training-dataset-dumper/ in a browser. You may need to authentificate with your CERN login credentials. In the top right corner of the TDD project you see three buttons which show a bell (notifications), a star (to favourite the project) next to a number, and a forking graph (to fork the project) with the text "Fork" next to a number. Click on the word "Fork" to open a new website, allowing you to specify the namespace of your fork. Click on "Select a namespace", choose your CERN username, and create the fork by clicking on "Fork project".
Next, you need to clone the project using git. Open a fresh terminal on your workstation, create a new folder and proceed with the installation as instructed in the quickstart / the documentation, with the only difference that we will use your fork as the origin
project. To do so, open your forked project in a browser. The address typically is https://gitlab.cern.ch/<your CERN username>/training-dataset-dumper
. When clicking on the blue "Clone" button at the right hand-side of the page, a drop-down mini-page appears with the ssh path to the forked git project. Let's check out your personal fork.
To do so, switch back to a place where you want to have the dumper code. Now do the following:
mkdir tdd
cd tdd
git clone ssh://git@gitlab.cern.ch:7999/<your CERN username>/training-dataset-dumper.git
As a result, you now have checked out your working copy of the TDD. Congratulations!
Now, set up a development branch for the tutorial.
cd training-dataset-dumper
git checkout -b my_tutorial_branch
For your convenience, it is a good idea to also attach the main project to the local copy obtained via git
.
git remote add upstream ssh://git@gitlab.cern.ch:7999/atlas-flavor-tagging-tools/training-dataset-dumper.git
From now on, you can get the latest version of the main project in the atlas-flavor-tagging-tools
group using git fetch upstream
and push your changes in a new branch to your personal fork.
You can do so, by pushing the branch my_tutorial_branch
you created earlier to your fork using git push origin my_tutorial_branch
.
After getting now a local version, we need to compile the dumper. To do so, you can run the following code in the tdd
folder, where you also cloned your fork:
# Setup the AthAnalysis version (this is the code base on which the dumper works)
source training-dataset-dumper/setup/athanalysis.sh
# Create a build directory for the dumper and go there
mkdir build
cd build
# Now cmake the dumper make scripts
cmake ../training-dataset-dumper
# Now, actually "make" the dumper
make -j 8
# After the compiling is done, we need to source the newly build dumper
source x*/setup.sh
# Now switch back to your directory
cd ..
2. Run a Test Job and Inspect the Output#
After successfully installing the TDD code, you will be in a position to run a test job using the package.
For this task, you will:
- Run a test job using the package test script.
- Inspect the output of the test job.
- Run another test job using a different configuration.
Hint: How can I run a test job?
The documentation includes instructions for running your first test job. Make sure to use the -h
argument to gain an understanding of the different command line options.
Hint: I can't find the output of the test job?
Make sure to read the relevant section in the documentation carefully, and use the -h
argument to see if you can find a argument for the test script that will specify the output location for the test job output.
Hint: How can I inspect the job's output?
Again, you can find useful information in the documentation.
h5ls
is a good tool for getting started, and is included with your installation of the TDD package. Try using the -h
argument to find out how you can use the tool. You can use h5diff
to compare the output of two different jobs.
Solution
If you have not already set up an analysis release and the paths for the dataset-dumper, do so now:
source training-dataset-dumper/setup/athanalysis.sh
source build/x*/setup.sh
Now create a run directory. This is not needed, but keeps all logs files and outputs in one folder.
mkdir run
cd run
As specified in the documentation, you can use the test-dumper
command to run a test job. This command takes one mandatory argument which specifies the input configuration for the test job. As mentioned in the docs, you can use pflow
for your first test job, which will run a test job using the EMPFlow.json
configuration file.
You need to use the -d
optional argument to place the test job output in your working directory. So, after running
test-dumper -d testjob pflow
testjob/output.h5
.
Next, run
h5ls -v testjob/output.h5
to list the contents of the job output. The -v
argument produces more verbose output. Use -h
to take a look at the other available arguments.
Finally, you should try running the test script with a different mandatory argument to pflow
test-dumper -d testjob_truth truth
If you have time, you can try using h5py
to open the output file using python, and inspect the contents. More detailed information about the use of h5py
is covered in the bonus tasks.
3. Modify the Configuration Files to Disable the Jet Calibration#
You are already familiar with how to run a test job and select different configurations.
Now we will touch one of the configuration files which are stored in configs/
and compare the output of running the dataset dumper with the modified file to the output when using the original file.
The modification we are about to make will deactivate the jet calibration when dumping the outputs to the file, resulting in using uncalibrated jet kinematic information when creating the output file.
For this task, you will:
- Run over the tutorial sample using the
configs/EMPFlow.json
config file. Save the output tooutput_with_jet_calibration.h5
. - Modify the
configs/EMPFlow.json
config file by deactivating the jet calibration. - Run over the tutorial sample again, using the modified config file.
- Compare the difference in the output files with and without jet calibration.
If you feel brave, you can write a simple python plotting script (see bonus tasks) to make a plot comparing the calibrated and uncalibrated jet momenta. For this, you should use a different input file, because we are working on 10 event test-files right now.
Hint: How do I run over the sample tutorial file?
You can process the tutorial sample using the dump-single-btag
executable. Check the available options by running with the -h
flag: one of the options is to name the output file. We assume that you have downloaded the tutorial sample as specified in the prerequisites to the path <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root
.
dump-single-btag -c ../training-dataset-dumper/configs/EMPFlow.json <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root --output output_with_jet_calibration.h5
Hint: Tired of typing long path names?
You can create a symlink from your run
directory to the configuration files with
ln -s ../training-dataset-dumper/configs/ configs
From now on, you can type configs/EMPFlow.json
in place of ../training-dataset-dumper/configs/EMPFlow.json
.
You are free to do the same with DAOD_FTAG1.ttbar_tutorial.root
, which will shorten many of these commands to something like:
dump-single-btag -c configs/EMPFlow.json DAOD_FTAG1.ttbar_tutorial.root
Hint: Where do I find the tutorial sample the task is referring to?
The prerequisites section of this page explains how to download the DxAOD file with 10 ttbar events which is used in this tutorial. It is hosted at in the dumper-test-files
repository.
Assuming you run on lxplus
, we suggest that you download it to eos
, using curl
.
You can retrieve it with the following commands:
cd /eos/user/${USER:0:1}/${USER}/
mkdir -p ftag_tutorial/data && cd ftag_tutorial/data
curl -s https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/dumper-test-files/-/raw/main/p6859/DAOD_FTAG1.601589.e8549_s4162_r14622_p6859.small.pool.root > DAOD_FTAG1.ttbar_tutorial.root
Hint: How can I deactivate the jet calibration?
In the config file configs/EMPFlow.json
, replace
"calibration": {
"file": "fragments/pflow-calibration.json"
},
with
"calibration": {},
to deactivate the jet calibration and change the output of the dataset dumper to use uncalibrated jet observables.
Hint: How can I compare the content of the two h5
files?
First let's make sure you changed something. You can run h5ls
on the outputs with and without jet calibration applied. You should see that they have a different number of jets, because we're applying a selection on p_\mathrm{T}, |\eta|, and JVT. If you delete the entries in the selection
part of the calibration you should see the same number of jets in both cases.
While the h5ls
script provides basic functionality to inspect the content of h5
files, more control is provided using python
scripts using the h5py
package.
If you work on your private machine, you can simply install it with pip install h5py
. If you work on an institute machine, you can either use a virtual environment (see here) or set up an LCG view, e.g. LCG view 101
which supports python3
and h5py
.
source /cvmfs/sft.cern.ch/lcg/views/LCG_101/x86_64-centos7-clang12-opt/setup.sh
You can access the content of a h5
file with h5py
with similar syntax as for python a dict
.
A simple python
script to print the jet pt which are stored in a h5
output file with the name output.h5
could look like the following code snippet:
from h5py import File
with File("output.h5", 'r') as h5file:
jets = h5file['jets']
print(jets['pt'])
Solution
If you have not already set up an analysis release and the paths for the dataset-dumper, do so now:
source training-dataset-dumper/setup/athanalysis.sh
source build/x*/setup.sh
cd run
Run the dataset dumper over the tutorial sample. Give the output file a non-default name so we don't overwrite it later!
dump-single-btag -c ../training-dataset-dumper/configs/EMPFlow.json <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root --output output_with_jet_calibration.h5
Open the config file in a text editor of your choice and replace in configs/EMPFlow.json
"calibration": {
"file": "fragments/pflow-calibration.json"
},
with
"calibration": {},
to deactivate the jet calibration and save your changes.
Run the dataset dumper over the tutorial sample another time.
dump-single-btag -c ../training-dataset-dumper/configs/EMPFlow.json <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root
We will write a simple script to output the calibrated and uncalibrated jet pt
, using the python library h5py
.
On lxplus
or other environments, you can provide it using LCG views. On your private machine, you can install it using pip
.
Set up the LCG view 101
which supports python3
and h5py
.
source /cvmfs/sft.cern.ch/lcg/views/LCG_101/x86_64-centos7-clang12-opt/setup.sh
Save the following content as a python script called print_jet_pt.py
.
from h5py import File
input_file = "output_with_jet_calibration.h5"
input_file_raw = "output.h5"
with File(input_file, 'r') as h5file:
jets = h5file['jets']
print(jets['pt'])
with File(input_file_raw, 'r') as h5file_raw:
jets_raw = h5file_raw['jets']
print(jets_raw['pt'])
Execute the python script and compare the print-outs.
python3 print_jet_pt.py
4. Add/Remove a Jet Variable in the Configuration Files#
Now that you are familiar with modifying the configuration files of the dataset dumper, we will modify the lists of jet and track variables scheduled to be written to the output file.
For this task, you will:
- Run the dataset dumper with the default config file to produce a reference output
h5
file, - Open the configuration file and remove the jet kinematic information, as well as the jet flavour label from the scheduled list of output variables.
- Run the dataset dumper with the modified config file to produce a second output
h5
file. - Inspect both
h5
files and compare their content.
Hint: where do I find the config file where the output variables are defined?
You can find the configuration files in the directory configs
.
Information on their structure is provided in the documentation.
Have a look both at the config file you are using and the fragments it includes, which reside in the configs/fragments
directory.
Hint: how can I find the corresponding names in the EDM of the jet kinematic information and jet flavour labels?
A comprehensive overview of all variables currently being dumped from a reference file is provided in the documentation, together with some explanation about their meaning. The kinetic properties of jets are typically encoded as four-vectors, p = (energy, pt, eta, phi)
. Because of the cylindrical symmetry of the ATLAS detector, we neglect the component phi
in the training. For finding the variable which labels the jet flavour, use the browser's search function for the word jet label
on that page.
Solution
If you have not already set up an analysis release and the paths for the dataset-dumper, do so now:
source training-dataset-dumper/setup/athanalysis.sh
source build/x*/setup.sh
cd run
Run the dataset dumper over the tutorial sample.
dump-single-btag -c ../training-dataset-dumper/configs/EMPFlow.json <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root
Inspect the content of the output file.
h5ls -v output.h5 | tee content_before_modification.txt
The variables which you need to remove from the config file is:
"pt"
"eta"
"energy"
"HadronConeExclTruthLabelID"
Open the config file configs/EMPFlow.json
in your favourite text editor and inspect it. The desired variables are not directly listed but are listed in one of the config fragments that are included. You need to also open configs/fragments/pflow-variables.json
. In this file, you can find the variables listed above (the pflow-variables.json
is a fragment of pflow-base.json
).
When you remove them, pay close attention to the ,
at the end of the lines to still have a valid json
file. Learn about the json
structure here. Save the modified file now.
With the modified file, run the dumper again over the tutorial sample.
dump-single-btag -c ../training-dataset-dumper/configs/EMPFlow.json <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root
Inspect the content of the new output file (note that the previous output.h5
file got overwritten).
h5ls -v output.h5 | tee content_after_modification.txt
Compare the two text dumps of the output file content.
diff content_before_modification.txt content_after_modification.txt
You should see that the variables which you removed from the config file are missing from the new output file.
5. Schedule Evaluation of a Neural Network when Running the Training-Dataset-Dumper#
The TDD can not only dump the content of a DxAOD file and convert it to an h5
file, it is also possible to schedule neural networks provided in the ONNX
format via the MultiFoldTagger
CA block
.
The MultiFoldTagger
block is based on the MultifoldGNNTool
that is implemented in Athena. It is an extension of the GNNTool
to apply a k-fold trained tagger. While it is able to apply a k-fold trained tagger like GN2, it is also able to apply a simply neural network without folds.
For this task, you will:
- Read the documentation how to add a tagger in
ONNX
format. - Identify the path to the
GN3EPCLV01
tagger trained with 417M jets using the overview page of available taggers. - Add a
MultiFoldTagger
CA block
for theGN3EPCLV01
tagger to theca_blocks
list in the config fileconfigs/EMPFlow.json
, which will apply the new tagger to thejet
object. - Figure out the output variables of
GN3EPCLV01
and add them to the variables list. - Run the dataset dumper with the modified config file to produce a second output
h5
file. - Inspect the output
h5
file and look at its content. Are all the variables you define from the new tagger present?
Hint: How can I schedule taggers in the config file?
In addition to the already given documentation, there is also an already-existing example in the configs/EMPFlow.json
.
Solution
Open the config file configs/EMPFlow.json
in your favourite text editor and inspect it.
At the bottom of the file (second to last) is a block called "ca_blocks"
. The block itself is a list of dicts. Each of these dicts is one CA block
. As you can see, there are already two MultiFoldTagger
scheduled.
Now that we know where to add it, we need to figure out the path and the variables of GN3EPCLV01
. The path of the tagger can easily be found in the overview page of available taggers. The tagger we are looking for is this one. The full path of the tagger is shown directly as the first thing. Keep in mind, for the final path, Pathfinder
is used, which will be looking for the network in the GroupData/
folder. This is not so important for local tests, but if you run on the grid, you should define the path as in the example below.
Now that we have the path, we can schedule our CA block
by adding this here to the list:
{
"block": "MultifoldTagger",
"nn_paths": ["BTagging/20250912/GN3EPCLV01/antikt4empflow/network.onnx"]
}
The tagger is now evaluated when running the TDD, but the variables are not yet stored. To do so, we need to figure out what the output of GN3EPCLV01
is. To do so, we go back to the list of available taggers. Under the path, a small amount of metadata is shown. Also listed there are the outputs of the tagger. Although it might seem very easy to simply use them, some newer models have for convenience additional outputs that are combinations of the shown ones plus auxillary outputs (from auxillary tasks). To ensure we have everything, we can scroll down a bit and look at the "Full JSON Metadata
part. When opened, you will see a huge JSON
file with all the metadata of the model. To find the output variables, we can simply search for output_names
. You will find one entry with all the outputs. For GN3EPCLV01
, these are:
"output_names": [
"GN3EPCLV01_pb",
"GN3EPCLV01_pc",
"GN3EPCLV01_ps",
"GN3EPCLV01_pud",
"GN3EPCLV01_pg",
"GN3EPCLV01_ptau",
"GN3EPCLV01_ptFromTruthDressedWZJet",
"GN3EPCLV01_pbquark",
"GN3EPCLV01_pantibquark",
"GN3EPCLV01_pcquark",
"GN3EPCLV01_panticquark",
"GN3EPCLV01_pother",
"GN3EPCLV01_pu",
"GN3EPCLV01_TrackOrigin",
"GN3EPCLV01_VertexIndex",
"GN3EPCLV01_TrackType"
]
As we can see, there are more output variables than expected. While the "default" ones are easy to spot (_pb
, _pc
, etc.), there are also the outputs from the auxillary tasks:
_ptFromTruthDressedWZJet
: The output from the jet p_\mathrm{T} regression auxillary task_pbquark
,_pantibquark
,_pcquark
,_panticquark
,_pother
: Output of the jet charge auxillary task_TrackOrigin
,_VertexIndex
,_TrackType
: Output of the per-track auxillary tasks
While the first two are per-jet quantaties and can easily be added together with the probabilities, the per-trakc auxillary task outputs can't be added easily. For that, you need the GNNAuxTaskMapper
, which is a bit too much for this tutorial.
Lastly, we have the _pu
output. If you compare the list of from the short metadata summary to the actual output_names
list, you will see that the outputs are renamed and also _pu
shows up. _pu
is artificially added after the models training to "mimic" a light jet output. It is simply the sum of _ps
, _pg
, and _pud
.
Now let's add our variables to the output. To do so, we add them as floats to the jet
variables:
"variables": {
"jet": {
"ints": [
"n_tracks"
],
"uints": [
"jetFoldHash",
"jetFoldHash_noHits",
"jetFoldRankHash"
],
"floats": [
"GN2Lep_pb",
"GN2Lep_pc",
"GN2Lep_pu",
"GN2Lep_ptau",
"GN2NoAux_pb",
"GN2NoAux_pc",
"GN2NoAux_pu",
"GN2NoAux_ptau",
"GN2v01_pb",
"GN2v01_pc",
"GN2v01_pu",
"GN2v01_ptau",
"GN3EPCLV01_pb",
"GN3EPCLV01_pc",
"GN3EPCLV01_ps",
"GN3EPCLV01_pud",
"GN3EPCLV01_pg",
"GN3EPCLV01_ptau",
"GN3EPCLV01_ptFromTruthDressedWZJet",
"GN3EPCLV01_pbquark",
"GN3EPCLV01_pantibquark",
"GN3EPCLV01_pcquark",
"GN3EPCLV01_panticquark",
"GN3EPCLV01_pother",
"GN3EPCLV01_pu"
]
},
"eventwise": {
"customs": [
"nJets"
],
"compressed": [
"primaryVertexToBeamDisplacementX",
"primaryVertexToBeamDisplacementY",
"primaryVertexToBeamDisplacementZ"
]
}
},
Afterwards, we can re-run the dumping with:
dump-single-btag -c ../training-dataset-dumper/configs/EMPFlow.json <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root
Inspect the content of the output file and look for the output variables:
h5ls -v output.h5 | grep GN3EPCLV01
6. Open a Merge Request#
The TDD is actively developed. Please have a look at the development guidelines. Discussion about the latest developments takes place via GitLab issues in the main project and in the Mattermost channel (click here to sign up for the ATLAS FTAG mattermost team). Furthermore, there is the FTAG discourse platform using threads to organise discussion topics.
The goal of this task is to make you familiar with the GitLab workflow of opening merge requests to add code changes. A prerequisite for this task is that you have created a fork of the project in the first task of the tutorial.
For this task, you will:
- Identify an issue to address with a merge request in the list of open issues.
- Create a new branch, then modify the code locally, commit your changes with a descriptive commit message locally and push the new branch to your fork.
- Create a merge request of the fork to the main project using the GitLab web interface.
Hint: Where can I find more information on how to use git
and GitLab?
Several resources on using git
for version control exist. Please refer to this collection of useful resources.
Solution
We assume that you have created a fork of the project in the first task of the tutorial and will use the same names for your fork and the main project as in the solution to the first task. That is, the fork is origin
and the main project is upstream
.
Choose an issue from the list of open issues to address. We will assume for the solution that you want to improve the documentation.
Create a new branch. For the sake of this solution we assume that we call it improved_documentation
. You can of course choose a different name which provides a brief description of your planned modification.
git checkout -b improved_documentation
Pull the latest changes from the main project main branch (which is called main
).
git pull upstream main
With your favourite text editor, carry out the planned modifications and commit your changes with a descriptive commit message.
# Add your changed files (The . adds all changes files in the directory)
git add .
# Commit your changes
git commit -m "improve documentation"
Push your changes to your personal fork.
git push -u origin improved_documentation
In the text appearing in your console, you will see a link to a GitLab webpage. Follow that link to directly open a merge request.
You can also go to the webview of your fork. There should be a notification at the top of your page that asks you if you want to create a merge request. If you click this, you will be asked to define a source branch and a target branch. The source branch should be your improved_documentation
branch in your fork and the target should be main
of the actual TDD repository. Now click Compare branches and continue
and follow the instructions in the merge request template.
7. Run the Training-Dataset-Dumper on the Grid#
While the previous tasks target local development and testing configurations, the typical use-case of the TDD is to process large files on the LHC computing grid. For running the TDD on the grid, we first need a grid certificate. Please refer to this page for further information. We assume that you have the grid certificate downloaded and ready to use.
The dataset we want to process with the TDD is defined in the sample list for grid jobs FTagDumper/grid/inputs/single-btag.txt
.
We will just subit a job for the first sample in that list which starts with mc23_13p6TeV.601229.PhPy8EG_A14_ttbar_hdamp258p75_SingleLep.deriv.DAOD_FTAG1
.
For this task, you will:
- Set up the dataset dumper for submitting to the grid.
- Find the input file, and comment out all but the above entry
- Schedule an available tagger by adding it to the config file used by the
grid-submit single-btag
submission script, seeconfigs/EMPFlow.json
. - Commit these changes to your development branch.
- Dry run the submission process to test if everything works.
- Submit the dataset to the grid and tag it with
Test
. - Check if you have successfully submitted job using the
BigPanDA
website and monitor its progress. - After the job has finished, retrieve the output
h5
file from the grid usingrucio
.
When submitting to the grid, the dataset dumper will automatically take a snapshot of your current setup by creating a git
commit and push a tag to the repository. Make sure, that you have no un-commited changes to your files before running the grid submission script.
Hint: Where do I find information about running dataset dumper on the grid?
If you have trouble finding the correct setup file, have a look here: Grid Dumps documentation.
Solution: Set up the dataset dumper for submitting to the grid.
First you need to setup the required software (similar as for local usage, assuming here you already built it).
source training-dataset-dumper/setup/athanalysis.sh
source build/x*/setup.sh
Now, we need to prepare the grid setup:
source training-dataset-dumper/FTagDumper/grid/setup.sh
If you are working on lxplus
, this should work instantly. If you are working on a cluster with cvmfs access, you might need to run lsetup emi
before sourcing the grid setup.
Hint: In which directory can I find the correct submission script and the text file with the sample to be submitted?
Check the training-dataset-dumper/FTagDumper/grid
folder. Look inside the file inputs/single-btag.txt
and search for the input datasets.
Hint: In which config file can I schedule an additional tagger and where?
Look in the directory with config files and search for the EMPFlow.json
file. Open it with your favourite text editor and look out for the dl2_configs
entry.
Solution: Schedule an available tagger by adding it to the config file used by the grid-submit single-btag
submission script.
You need to add the tagger as a DL2 entry. A nice explanation how this is done is given here
Hint: Where do I find information on dry running the grid submission?
Try to run
grid-submit -h
and check the options for the dry run and the tag option.
Solution: Dry run the submission process to test if everything works.
The dry run (without actual submission) can be started with
grid-submit -d -t Test single-btag
This will dry run the submission process without actually submitting the datasets defined in INPUT_DATASETS
.
Hint: Where do I find information on how to tag the output when submitting to the grid?
Try to run
grid-submit -h
and check the options for the tag option.
Hint: How can I just dump a small part of a sample?
Ensure rucio is setup
lsetup rucio
Then select the number of events to dump via
grid-submit -c {config} -i {inputs} -n 10000 single-btag
This will select the number of files in each input container required to get at least 10,000 events. If each file has 20,000 events, then asking for 10,000 will run a single file, and produce 20,000.
Hint: I get an error message that git remote get-url
is not avaliable. What can I do about it?
The functionality git remote get-url
only becomes avaliable in git
versions above 1.8.3.1
.
If you work on lxplus, you can get a recent version of git
with lsetup
.
setupATLAS
lsetup git
Solution: Submit the dataset to the grid and tag it with Test
.
To actual submit the dataset:
grid-submit -t Test single-btag
This will submit the samples and also create a tag of the current version of the the local version of the dataset dumper to your fork.
Solution: Check if you have successfully submitted job using the BigPanDA
website and monitor its progress.
You can find your submitted job on the BigPanDA
webpage. Click at the top on My BigPanDA
and scroll down. You should be able to see your job there. If not, wait for a few seconds and click at the top right on Refresh
.
Solution: After the job has finished, retrieve the output h5
file from the grid using rucio
.
To retrieve the finalised file from rucio, we first need to setup rucio. This can be done by running:
setupATLAS
localSetupRucioClients
If you are not running on lxplus, you need a cluster with cvmfs access. In some cases you need to run lsetup emi
and voms-proxy-init -voms atlas
before running localSetupRucioClients
.
Now you need the name of container/dataset. This can be retrieved from the job BigPanDA job. Scroll down on your BigPanDA page and click on the task name. Now scroll a bit down and you will find the Containers
part with the input and the output. Normally two outputs are provided: the log files of the job (they have the ending .log
) and the real output files (they have the ending _output.h5
). The container/dataset name that we are searching is the full name with the _output.h5
at the end.
After that, you can simply download your file from the grid with rucio -v download <dataset-name>
. Please keep in mind that the files produced by the grid are stored on scratch disks and are erased after two weeks. To keep them save, you need to copy them to a local disk with the Rucio web interface. Create a rule there and add your dataset/container ID and save it on a local disk where you have write access. After submitting the rule, your dataset is copied automatically to the disk.
Bonus tasks#
Download a DxAOD File using Rucio#
The typical use-case of the training-dataset dumper is to process the large DxAOD dataset containers on the grid.
However, for local testing it is often useful to download single DxAOD files and process them locally for development.
Retrieval of files stored in the computing grid infrastructure is possible using the rucio
tool.
You can find documentation on its usage in the ATLAS software tutorial.
You can find information on recommended DxAOD MC samples in the FTAG algorithm docs MC sample page.
For this task, you will:
- Identify the dataset container name of a PHYSVAL DxAOD ttbar sample for the mc20d MC campaign.
- Download a single file from that container.
Solution
Open the website https://ftag-docs.docs.cern.ch/samples/samples/ and search for the ttbar
sample entry for the MC campaign mc20d
. Scrolling to the right, you should see an entry in the DAOD column, e.g. mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.deriv.DAOD_FTAG1.e6337_s3681_r13144_p5627
(might be a different file at the time you are looking at this tutorial: in this case use the name listed on the webpage).
Note that you can only do the following task (within the scope of this tutorial) on a machine with /cvmfs
access, e.g. on CERN's lxplus
machines.
In a shell, set up the ATLAS software, set up rucio
and initialise your grid proxy.
setupATLAS
letup rucio
voms-proxy-init -voms atlas
Next, download one random file from the dataset container
rucio download --nrandom 1 mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.deriv.DAOD_FTAG1.e6337_s3681_r13144_p5627
Inspect the Content of a DxAOD File#
Having a list of all possible input variables in an DxAOD file is often very useful.
For this task, you will dump a list of all objects and variables stored in the DxAOD to a text file.
To achieve this, you can make use of the checkFile.py
script which is provided in the AthAnalysis
and Athena
releases.
Solution
Set up the dumper using an Athena
and not an AthAnalysis
release.
source training-dataset-dumper/setup/athena.sh
Run checkFile.py
on the DxAOD file of which you want to dump the file content.
We assume you are running on the tutorial sample file stored in `
checkFile.py -d <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root
The -d
command line option prompts the script to provide a detailled dump.
It can be useful to store the dump in a text file.
checkFile.py -d <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root | tee dxaod_content.txt
I can also be useful to employ the command line tool grep
to filter out certain information. Below is an example to only show variables including SV1
.
checkFile.py -d <your_path_to>/ftag_tutorial/data/DAOD_FTAG1.ttbar_tutorial.root | grep SV1
Write a Plotting Script to Display Variables in H5 File#
With the output h5
files of the TDD, it is easy to create plots of variables in the python
ecosystem using h5py
to process the h5
files and the packages numpy
+ matplotlib
to create the plots.
You need to install the packages h5py
, numpy
, and matplotlib
for this task. On your private machine you can do this using pip
. On lxplus
or your institute's machine, you can either use virtual environment (see here) or set up an LCG view, e.g. LCG view 101
which supports python3
and h5py
.
Your task is to plot a histogram of the pt
distribution of jets in the output.h5
file created with the training-dataset dumper when processing the tutorial MC sample.
Solution
We assume you are working on lxplus
and are using LCG views to provide the required python
packages.
Set up the LCG view 101
which supports python3
and h5py
.
source /cvmfs/sft.cern.ch/lcg/views/LCG_101/x86_64-centos7-clang12-opt/setup.sh
We assume that you still have on output file of running the dumper from a previous task, which is called output.h5
and which is located in the training-dataset-dumper
directory.
Save the following content as a python script called plot_jet_pt.py
.
from h5py import File
import matplotlib.pyplot as plt
input_file = "output.h5"
with File(input_file, 'r') as h5file:
jets = h5file['jets']
fig, ax = plt.subplots()
ax.hist(jets['pt'])
ax.set_xlabel('jet pt [MeV]')
ax.set_ylabel('Entries')
fig.savefig('jet_pt.png')
Execute the python script and compare the print-outs.
python3 plot_jet_pt.py
The resulting plot is stored in jet_pt.png
. You can take a look at it using display
.
display jet_pt.png
In case that you want to improve the layout and style of the plot, consider having a look at the package mplhep
which provides the ATLAS plot style.
Manipulate h5 files#
You can manipulate h5 files using python scripts which employ h5py
.
Have a look at these git projects and the scripts which they host: