Athena FTAG algorithms tutorial#

Introduction#

In this tutorial you will learn to use the training-dataset-dumper together with Athena tools. The training-dataset-dumper is an essential tool for extract information from data files in the ATLAS event data model. In addition, it allows for re-running Athena algorithms with local modifications to store the output of these modified algorithms.

In case you are not familiar with running the training-dataset-dumper on its own, please refer to the training-dataset-dumper tutorial.

In the first tutorial on basic usage of the training-dataset-dumper you have used the dataset-dumper with the AthAnalysis release and the dump-single-btag executable. In this tutorial, we will be using the Athena release which in principle provides the full functionality of modifying all ATLAS tools and algorithms locally. As a consequence of using the training-dataset-dumper with Athena, you will be using the ComponentAccumulator based configuration provided by the ca-dump-single-btag job option which provides similar functionality.

You can learn more about the Run-3-Athena configuration in the CA configuration documentation.

Another important concept covered in this tutorial is how you can retrieve Athena packages in an economic way and modify them locally. General information on the Athena git workflow is provided in the ATLAS git workflow tutorial. We will be using a convenience wrapper script provided by Dan Guest which simplifies the checkout of packages.

You can learn about it in the Git Fatlas GitHub project.

The structure of the tutorial is similar to that covering the basic usage of the training-dataset-dumper. You will be given a list of tasks in which you will learn the essential steps to modify Athena tools locally and store the modified output in h5 files.

In this tutorial you will learn how to:

Clone, and install the training-dataset-dumper using the version of the Athena software indicated in setup_athena.sh.
Run a test job and inspect the output.
Dump an h5 ntuple using the version of the Athena software indicated in setup_athena.sh.
Check out a local version of an Athena tool, modify it, and compile the dumper to use the modified tool.
Plot the changes with a python script based on either matplotlib or root.
Inspect a merge request with changes to an Athena tool and compile the dumper using the modified tool.

The tutorial is meant to be followed in a self-guided manner. You will be prompted to do certain tasks by telling you what the desired outcome will be, without telling you how to do it. Using the documentation of the training-dataset-dumper, you can find out how to achieve your goal. In case you are stuck, you can click on the "hint" toggle box to get a hint. If you tried for more than 10 min at a problem, feel free to toggle also the solution with a worked example.

In case you encounter some errors, please reach out on the training-dataset-dumper mattermost channel (click here to sign up) and open a merge request to fix the tutorial.

Prerequisites#

We assume that you already followed the first tutorial on the training-dataset-dumper and its usage. Please use the same environment which you used in the first tutorial also for this tutorial. This can either be a workstation with cvmfs access, such as lxplus, or your local computer using a Docker container.

Please refer to the prerequisites section in the first tutorial on using the training-dataset-dumper for instructions how to prepare for this tutorial.

Tutorial tasks#

1. Clone and install the training-dataset-dumper using the Athena release#

Before you can start with the other tasks, you need to do this one first. The expected outcome of this task is that you will have

have cloned it to your work area on your machine using git,
set up a development branch for the tutorial called my_tutorial_branch,
set up the analysis release corresponding to the version of the Athena software indicated in setup_athena.sh,
successfully compiled it and set up the paths to be able to use it.

Go to the GitLab project page of the training-dataset-dumper to begin with the task: https://gitlab.cern.ch/atlas-flavor-tagging-tools/training-dataset-dumper/ and have a look at the setup scripts which are provided. While setup-athena.sh sets up the recommended release version of the Athena software to work with the training-dataset-dumper, the setup script setup-athena-latest.sh sets up the latest version of the master branch. For our studies in the tutorial, we will use the former one.

Note that for production of larger datasets, you should not use the "latest" version because the Athena nightly builds get deleted after some time, so that you might use a setup which is not reproducible for the datasets you dumped.

Hint: how can I clone and compile the project?

In case you are stuck and don't know what to do to retrieve the project code using git clone and how to compile it, have a look at the training-dataset-dumper's documentation on its advanced usage.

Hint: how can I create a new branch?

You can create a new branch and change to it with git using the following command:

git checkout -b my_tutorial_branch

Hint: what should I do differently to set up the Athena release compared to setting up AthAnalysis?

Make sure that you are reading the documentation on the advanced usage of the dataset dumper and not the basic installation instructions.

The important difference in the setup is that you need to

source training-dataset-dumper/setup-athena.sh

before building the project.

Solution

You need to clone the project using git. Open a fresh terminal on your workstation and create a new folder for the tutorial

mkdir tdd-athena && cd tdd-athena

Then proceed to check out the main project. Remember that for development you should always work in your personal fork of the project and not the main project!

git clone ssh://git@gitlab.cern.ch:7999/atlas-flavor-tagging-tools/training-dataset-dumper.git

Now, set up the Athena analysis release and compile the project.

source training-dataset-dumper/setup-athena.sh
mkdir build
cd build
cmake ../training-dataset-dumper
make
source x*/setup.sh
cd ..

As a result, you now have checked out and compiled your working copy of the training-dataset-dumper using the Athena analysis release. Congratulations!

Now, set up a development branch for the tutorial.

cd training-dataset-dumper
git checkout -b my_tutorial_branch

2. Run a test job and inspect the output#

After successfully compiling the training-dataset-dumper code and finishing the setup, you will be in a position to run a test job using the package.

For this task, you will:

Run a test job using the package test script.
Inspect the output of the test job.

Hint: how can I run a test job?

The documentation includes instructions for running your first test job. Make sure to use the -h argument to gain an understanding of the different command line options. As explained in the documentation on the advanced usage of the training-dataset-dumper, you need to enter

test-dumper ca

to run the test job for the component-accumulator-based setup.

Hint: I can't find the output of the test job?

Make sure to read the relevant section in the documentation carefully, and use the -h argument to see if you can find a argument for the test script that will specify the output location for the test job output.

Hint: how can I inspect the job's output?

Again, you can find useful information in the documentation. h5ls is a good tool for getting started, and is included with your installation of the TDD package. Try using the -h argument to find out how you can use the tool. You can use h5diff to compare the output of two different jobs.

Solution

If you have not already set up an analysis release and the paths for the dataset-dumper, do so now:

source training-dataset-dumper/setup-athena.sh
source build/x*/setup.sh
mkdir run
cd run

As specified in the documentation on the advanced usage, you can use the test-dumper command to run a test job. This command takes one mandatory argument which specifies the input configuration for the test job. We will use the ca configuration which launches the test job for the component-accumulator-based setup.

You need to use the -d optional argument to place the test job output in your working directory. So, after running

test-dumper -d testjob ca

you should have a file named testjob/output.h5.

Next, run

h5ls -v testjob/output.h5

to list the contents of the job output. The -v argument produces more verbose output. Use -h to take a look at the other available arguments.

3. Dump an h5 ntuple using the release version of the Athena software recommended in the TDD#

After successfully running the test jobs, you are asked to dump an h5 ntuple using the training-dataset-dumper compiled against the release version of the Athena software indicated in setup_athena.sh. Use the baseline configuration file (EMPFlow.json), that fills the output with the variables you will need for this tutorial.

For this task you will:

Run the ca-dump-single-btag script to produce an h5 ntuple with 1000 events.
Call the resulting output ntuple with a name you will remember (e.g. output_vanilla_athena.h5), and dump it in an ad hoc folder called <as_you_like>.

We will compare the resulting ntuple with those obtained during the following part of the tutorial.

Hint: how can I to produce an h5 ntuple using the component-accumulator-based setup with the training-dataset-dumper and Athena?

More information can be found in the "running" section of the advanced usage section of the TDD documentation.

Hint: how do I set up the required release version of the Athena software?

You should have done that in the first task of this tutorial. Please have a look!

Hint: Where do I find the tutorial sample the task is referring to?

The prerequisites section of the first training-dataset-dumper tutorial page explains how to download the DxAOD file with 1000 ttbar events which is used in this tutorial. It is hosted at https://umami-ci-provider.web.cern.ch/tutorial/DAOD_PHYSVAL.ttbar_tutorial.root.

Assuming you run on lxplus, we suggest that you download it to eos, using wget. You can retrieve it with the following commands:

cd /eos/user/${USER:0:1}/${USER}/
mkdir -p ftag_tutorial/data && cd ftag_tutorial/data
wget https://umami-ci-provider.web.cern.ch/tutorial/DAOD_PHYSVAL.ttbar_tutorial.root

Hint: how can I choose the ntuple's name?

Run ca-dump-single-btag -h for an explanation of the arguments you can choose.

Hint: how can I limit the number of events dumped in the output ntuple?

Run ca-dump-single-btag -h for an explanation of the arguments you can choose.

Solution

Move to the tdd-athena top-level directory and set up the version of the Athena software indicated in setup_athena.sh.

cd tdd-athena
source training-dataset-dumper/setup-athena.sh

Compile the dataset-dumper against the version of the Athena software indicated in setup_athena.sh.

cd build
cmake ../training-dataset-dumper
make -j4
source x*/setup.sh
cd ..

Now, create the run directory and run over the tutorial file.

mkdir run
cd run

ca-dump-single-btag -c ../training-dataset-dumper/configs/single-b-tag/EMPFlow.json -m 1000 -o <as_you_like>/output_vanilla_athena.h5 /eos/user/${USER:0:1}/${USER}/ftag_tutorial/data/DAOD_PHYSVAL.ttbar_tutorial.root

4. Check out a local version of an Athena tool, modify it, and compile the dumper to use the modified tool#

This task is the most important part of the tutorial and illustrates an important use-case for going beyond the standard usage of the training-dataset-dumper. You are asked to modify locally an Athena package and compile the training-dataset-dumper against it, then dump a file with the modified setup and evaluate how the changes made to the Athena package propagate to the content in the dumped h5 file.

As an example, we will look into changes to the InDetVKalVxInJetTool which searches for secondary vertices inside jets with the VKalVrt vertex reconstruction package.

We will modify the value of CutBVrtScore, which defines the B vertex selection cut on 2track vertex score (probability-like) based on track classification and evaluate the effect on the output quantities of the reconstructed SV1 vertices, such as SV1_masssvx.

For this task, you will:

Check out the Athena tool InnerDetector/InDetRecTools/InDetVKalVxInJetTool using git-fatlas.
Modify the value of the tool's property CutBVrtScore by setting the variable m_cutBVrtScore to 0.001 (default: 0.015).
Modify the code to print an ATH_MSG_INFO with a sentence you like (e.g. "I love watching b-hadrons fly") when InDetVKalVxInJetTool is initialized to provide a visual feedback that you indeed are running with the modified version.
Compile the training-dataset-dumper against your modified version of InDetVKalVxInJetTool.
Produce a h5 ntuple with 1000 events. Use the configuration defined in training-dataset-dumper/configs/EMPFlow.json. Call the ntuple with a name you will remember (e.g. output_athena_with_my_changes.h5 ). Make sure you save the log of the execution in a text file.
Inspect the log to see if your ATH_MSG_INFO has been printed (this means that everything went according to the plans!)

Hint: how do I checkout a single Athena package?

The related section in the training-dataset-dumper documentation describes how to check out a local copy of a single Athena package.

Hint: where can I find the InDetVKalVxInJetTool in the ATLAS athena code?

You can use the LXR code browser to search for package names. Then you can inspect them using the gitlab ATLAS athena project. For instance, the InDetVKalVxInJetTool is located in InnerDetector/InDetRecTools/InDetVKalVxInJetTool.

Hint: how can I set up git-fatlas?

We recommend checking out individual Athena packages using Dan Guest's convenience tool git-fatlas. To make it avaliable on your machine, follow these instructions:

mkdir -p ~/utils && cd ~/utils
git clone git@github.com:dguest/git-fatlas.git
source git-fatlas/git-fatlas.sh

You could consider adding a line to your ~/.bashrc to automatically set up git-fatlas whenever you open a new shell.

Hint: how do I check out InDetVKalVxInJetTool using git-fatlas?

Assuming you have already set up git-fatlas, you should follow these instructions to check out InDetVKalVxInJetTool.

First, initialise the athena directory inside the tdd-athena/training-dataset-dumper directory.

cd tdd-athena/training-dataset-dumper
git-fatlas-init -r master
ls

You should observe that a new directory athena has been created. Change to that directory and check out the InDetVKalVxInJetTool package.

cd athena
git-fatlas-add InnerDetector/InDetRecTools/InDetVKalVxInJetTool
cd ../..

Now you can compile the training-dataset-dumper similar as in the previous step. The local package will be used instead of the one in the Athena release.

Hint: how do I modify InDetVKalVxInJetTool to print out a message during initialisation?

Assuming that you have already checked out the package locally, open tdd-athena/training-dataset-dumper/athena/InnerDetector/InDetRecTools/InDetVKalVxInJetTool/src/InDetVKalVxInJetTool.cxx in your favourite text editor.

With your text editor, move to the StatusCode InDetVKalVxInJetTool::initialize() function and add a statement to print out a message.

Below, we illustrate the desired modification.

Before changes:

[...]
StatusCode InDetVKalVxInJetTool::initialize(){
 ATH_MSG_DEBUG("InDetVKalVxInJetTool initialize() called");
 try{ m_compatibilityGraph = new boost::adjacency_list<boost::listS, boost::vecS, boost::undirectedS>();}
[...]

After changes:

[...]
StatusCode InDetVKalVxInJetTool::initialize(){
 ATH_MSG_DEBUG("InDetVKalVxInJetTool initialize() called");
 # new message added for the tutorial
 ATH_MSG_INFO("I love watching b-hadrons fly.");
 # end of modification
 try{ m_compatibilityGraph = new boost::adjacency_list<boost::listS, boost::vecS, boost::undirectedS>();}
[...]

Hint: how do I modify InDetVKalVxInJetTool to use a different value for CutBVrtScore?

Assuming that you have already checked out the package locally, open tdd-athena/training-dataset-dumper/athena/InnerDetector/InDetRecTools/InDetVKalVxInJetTool/src/InDetVKalVxInJetTool.cxx in your favourite text editor.

With your text editor, move to the constructor InDetVKalVxInJetTool::InDetVKalVxInJetTool and modify the default value of m_cutBVrtScore.

Below, we illustrate the desired modification.

Before changes:

[...]
m_zTrkErrorCut(5.0),
m_cutBVrtScore(0.015),
m_vrt2TrMassLimit(4000.),
[...]

After changes:

[...]
m_zTrkErrorCut(5.0),
m_cutBVrtScore(0.001),
m_vrt2TrMassLimit(4000.),
[...]

Hint: what executable should I use when dumping ntuples which pick up changes in locally modified packages?

Instead of ca-dump-single-btag, make sure that you use ca-dump-retag for dumping the ntuple.

Solution

We will use the git-fatlas tool to check out InnerDetector/InDetRecTools/InDetVKalVxInJetTool.

If you have not already set up git-fatlas, follow these instructions.

mkdir ~/utils && cd ~/utils
git clone https://github.com/dguest/git-fatlas.git
source git-fatlas/git-fatlas.sh
cd -

Using git-fatlas, check out InnerDetector/InDetRecTools/InDetVKalVxInJetTool:

cd training-dataset-dumper
git-fatlas-init -r master
cd athena
git-fatlas-add InnerDetector/InDetRecTools/InDetVKalVxInJetTool
cd ../..

Modify the source code of InDetVKalVxInJetTool using your favourite text editor. Open tdd-athena/training-dataset-dumper/athena/InnerDetector/InDetRecTools/InDetVKalVxInJetTool/src/InDetVKalVxInJetTool.cxx and do two modifications:

add a message during initialisation of the tool so that you can verify in the output when running the dumper that your modified version is used. You will find the InDetVKalVxInJetTool::initialize() function at line 167. The message could be ATH_MSG_INFO("I love watching b-hadrons fly.");.
modify the value of m_cutBVrtScore from 0.015 to 0.001. You will find it at line 49.

Now, you want to rebuild the code against the modification you just made. You will start by deleting your build folder, to avoid conflicting setups, and then recompile the code as already done at the beginning of the tutorial.

rm -rf build
mkdir build
cd build
cmake ../training-dataset-dumper
make -j4
source x*/setup.sh
cd ..

Unfortunately, a non-negiglible amount of time will be needed for building the code. Just be patient!

Once the compilation finished, (re-)create a run folder and launch the dumper again. Pay close attention to the print-outs to see if your message got printed out!

mkdir -p run && cd run
ca-dump-retag -c ../training-dataset-dumper/configs/single-b-tag/EMPFlow.json -m 1000 -o <as_you_like>/output_athena_with_my_changes.h5 /eos/user/${USER:0:1}/${USER}/ftag_tutorial/data/DAOD_PHYSVAL.ttbar_tutorial.root |& tee log_output_athena_with_my_changes.txt 

grep "I love watching b-hadrons fly." log_output_athena_with_my_changes.txt

5. Plot the changes with a python script based on either `matplotlib` or `root`#

With the output h5 files of the training-dataset-dumper, it is easy to create plots of variables in the python ecosystem using h5py to process the h5 files. You can create the plots using either the packages numpy + matplotlib or pyROOT (among many other options not discussed here) to create the plots.

Your task is to plot the distribution of the SV1_N2Tpair variable for the jets contained in the output_<version>.h5 files created with the training-dataset dumper when processing the tutorial MC sample. This variable corresponds to the number of good two-track vertexes found by the SSVF algorithm during the reconstruction process and is thus sensitive to the tuning of the algorithm itself.

When working with numpy + matplotlib, you need to install the packages h5py, numpy, and matplotlib for this task. On your private machine you can do this using pip. On lxplus or your institute's machine, you can either use virtual environment (see here) or set up an LCG view, e.g. LCG view 101 which supports python3 and h5py.

Solution using numpy + matplotlib

We assume you are working on lxplus and are using LCG views to provide the required python packages. Set up the LCG view 101 which supports python3 and h5py.

source /cvmfs/sft.cern.ch/lcg/views/LCG_101/x86_64-centos7-clang12-opt/setup.sh

We assume that you still have the output files of running the dumper from the previous task, which are located in the <as_you_like> directory, and called output_vanilla_athena.h5, output_athena_with_my_changes.h5 and output_athena_with_mr_51253.h5 respectively.

Save the following content as a python script called plot_jet_n2t_matplotlib.py.

from h5py import File
import matplotlib.pyplot as plt

input_file_vanilla_athena = File("<as_you_like>/output_vanilla_athena.h5","r")
jets_vanilla_athena = input_file_vanilla_athena['jets']

input_file_athena_with_my_changes = File("<as_you_like>/output_athena_with_my_changes.h5","r")
jets_athena_with_my_changes = input_file_athena_with_my_changes['jets']

input_file_athena_with_mr_51253 = File("<as_you_like>/output_athena_with_mr_51253.h5","r")
jets_athena_with_mr_51253 = input_file_athena_with_mr_51253['jets']

nbins=50
bins= (0,50)
plt.hist(jets_vanilla_athena["SV1_N2Tpair"], nbins, bins,  alpha = 0.5, label = "CutBVrtScore = 0.015")
plt.hist(jets_athena_with_my_changes["SV1_N2Tpair"],nbins, bins,  alpha = 0.5, label = "CutBVrtScore = 0.001")
plt.hist(jets_athena_with_mr_51253["SV1_N2Tpair"], nbins, bins, alpha = 0.5, label = "CutBVrtScore = 0.005")
plt.legend()

plt.savefig("jet_n2t.png")

Execute the python script and compare the print-outs.

python3 plot_jet_n2t_matplotlib.py

The resulting plot is stored in jet_n2t.png. You can take a look at it using display.

display jet_n2t.png

In case that you want to improve the layout and style of the plot, consider having a look at one of the packages mplhep or atlasify which provide the ATLAS plot style.

Solution using pyROOT

We assume you are working on lxplus and are using LCG views to provide the required python packages. Set up the LCG view 101 which supports python3 and h5py.

source /cvmfs/sft.cern.ch/lcg/views/LCG_101/x86_64-centos7-clang12-opt/setup.sh

We assume that you still have the output files of running the dumper from the previous tasks, which are located in the <as_you_like> directory and called output_vanilla_athena.h5, output_athena_with_my_changes.h5 and output_athena_with_mr_51253.h5 respectively.

Move to the <as_you_like> folder and save the following content as a python script called plot_jet_n2t_root.py.

from h5py import File
import numpy
import ROOT

ROOT.gStyle.SetOptStat(0)
ROOT.gStyle.SetOptFit(1)

input_files = ["output_vanilla_athena.h5", "output_athena_with_my_changes.h5", "output_athena_with_mr_51253.h5"]    

hist_n2t = ROOT.TH1F("n2t", "SSVF two-track vertexes", 21, -0.5, 20.5)
func = ROOT.TF1("exponential", "[0]*exp(-[1]*x)", 1, 20)

for input_file in input_files :

    with File(input_file, 'r') as h5file:

        jets = h5file['jets']
        arr_weight = numpy.asarray(jets['mcEventWeight'])
        arr_n2t = numpy.asarray(jets['SV1_N2Tpair'])

        hist_n2t.Reset()
        for i in range(numpy.size(arr_n2t)) :
            hist_n2t.Fill(arr_n2t[i], arr_weight[i])

    canvas = ROOT.TCanvas()
    hist_n2t.GetXaxis().SetTitle('# of 2T vtxs in jet');
    hist_n2t.GetYaxis().SetTitle('Entries');
    hist_n2t.Draw()

    func.SetParameter(0, 1000)
    hist_n2t.Fit("exponential")
    func.Draw("SAME")

    canvas.Print(input_file+'.jet_n2t.png')

Execute the python script and compare the print-outs.

python3 plot_jet_n2t_root.py

The resulting plots are stored in <as_you_like>/<name of the output_.h5 files>_jet_n2t.png. You can take a look at them using display.

display output_<version>_jet_n2t.png

6. Inspect a merge request with changes to an Athena tool and compile the dumper using the modified tool#

Sometimes, you are interested in validating the effects of an open merge request that modifies an Athena package. The training-dataset-dumper provides a convenient way of validating the effect of the modification. Similar as in the previous task, in which you checked out an athena package by yourself You now want to checkout a an Athena package from an open merge request, compile the TDD against it, and produce an h5 ntuple to study the effects of the change.

For this task, you will:

Check out Athena's InnerDetector/InDetRecTools/InDetVKalVxInJetTool from merge request !51253
Compile the TDD against the version of InDetVKalVxInJetTool in !51523
Produce a h5 ntuple with 1000 events. Use the configuration defined in training-dataset-dumper/configs/EMPFlow.json. Call the ntuple with a name you will remember (e.g. output_athena_with_mr_51253.h5 ). Make sure you save the log of the Component Accumulator based script's execution in a text file
Inspect the log to find wheter the word "FTAG TUTORIAL" is used (if you can't find it, something went unexpectedly).

Hint: how do I checkout a package from a merge request open in Athena?

Use git-fatlas together with standard git commands. More details can be found in documentation of git-fatlas.

Solution

cd training-dataset-dumper/athena

Due to your previous modifications, you may have uncommited files in your local version of InDetVKalVxInJetTool. Before checking out the version of the package in the target merge request, you need either to stash your changes (easier if you are not interested in using them again), or to commit them. Since you will not be using the local version of the code in the following of the tutorial anymore, this solution proceeds with git stash

git stash
git fetch atlas merge-requests/51253/head:athena_mr_51253 && git checkout athena_mr_51253
git-fatlas-add InnerDetector/InDetRecTools/InDetVKalVxInJetTool

Again, to avoid exposing yourself to the risk of facing unpleasants conflicts in building the code against an updated version of Athena, you may want to clean up your build folder. If you feel lucky, feel free to retry to build the code starting from your existing build folder though!

cd ../..
rm -rf build
mkdir build

cd build
cmake ../training-dataset-dumper
make
source x*/setup.sh
cd ..

mkdir -p run && cd run

ca-dump-retag -c ../training-dataset-dumper/configs/single-b-tag/EMPFlow.json -m 1000 -o <as_you_like>/output_athena_with_mr_51253.h5 /eos/user/${USER:0:1}/${USER}/ftag_tutorial/data/DAOD_PHYSVAL.ttbar_tutorial.root |& tee log_output_athena_with_mr_51253.txt

Can you find the key words "FTAG TUTORIAL" in log_output_athena_with_mr_51253.txt now?

Athena FTAG algorithms tutorial#

Introduction#

Prerequisites#

Tutorial tasks#

1. Clone and install the training-dataset-dumper using the Athena release#

2. Run a test job and inspect the output#

3. Dump an h5 ntuple using the release version of the Athena software recommended in the TDD#

4. Check out a local version of an Athena tool, modify it, and compile the dumper to use the modified tool#

5. Plot the changes with a python script based on either matplotlib or root#

6. Inspect a merge request with changes to an Athena tool and compile the dumper using the modified tool#

5. Plot the changes with a python script based on either `matplotlib` or `root`#