Skip to content

If you want to contribute to the development of the training dataset dumper, or push tags using the grid submit script, you should first create a fork of the repository. You can read in detail about the git forking workflow here. Gitlab also provides documentation on how to create a fork here. After forking the main repository, you can clone your fork and set the upstream url.

git clone ssh://git@gitlab.cern.ch:7999/<cern_username>/training-dataset-dumper.git
git remote add upstream ssh://git@gitlab.cern.ch:7999/atlas-flavor-tagging-tools/training-dataset-dumper.git

You can use the fork to keep track of your changes and if you find them well-placed to be added to the main branch, you can do so via a merge request. In order to integrate changes to your target branch that may have been merged during the development of your changes, you may have to rebase your development branch from upstream. More information about rebasing can be found here.

Package Layout#

The code lives under FTagDumper. All the top-level executables live in util/, whereas various private internal classes are defined in src/. CA scripts live in bin/. Meanwhile configuration files are found in configs/.

Feature / Bug Fix Workflow#

When adding features or fixing bugs, it's a good idea to let people know what you plan to work on before you start working on it. The following procedure should be followed:

  1. (Optional) Informal discussion on mattermost to check if the feature/bug exists, and is a suitable addition.
  2. Open an issue on GitLab. This is a place to describe in more detail the feature/bug, and to work out what is necessary to change in the code.
  3. Assign a responsible person to the issue. If the person who opened the issue has they capability, by default they should assign themselves to the issue. Otherwise the maintainers will assign a responsible person.
  4. Merge request. The assigned person should work on the feature and open a MR. This will be reviewed, any follow-up issues created, and finally merged. The corresponding issue should be closed.

Adding More Outputs#

Please note that in the interest of keeping this package from growing too large, some additions (in particular those not directly related to the dumping of information from xAOD) may have a better home elsewhere (i.e. in Athena). The best place to implement a feature can be discussed on mattermost on in an issue.

Adding decoration algorithms directly to Athena is strongly encouraged

As mentioned elsewhere, this package should really be about dumping data, not doing complicated processing. If you need to do more than very basic processing, please consider adding the functionality to Athena directly. See also AFT-596 and #36.

There are two general steps for data flow in this package:

  • Decorators manipulate xAOD objects: they read in objects, calculate any properties of interest, and store (decorate) these properties on the same objects.
  • Writers are responsible for xAOD -> HDF5 transcription: they read xAOD objects and write the associated data to HDF5 files.

We enforce this separation so that decorators can easily be ported upstream to derivations, reconstruction, or the trigger. It also helps to keep the xAOD -> HDF5 transcription generic.

Adding a decoration#

There's an example class in src/ExampleDecoratorAlg.cxx which should make the implementation a bit more clear. This algorithm is configured via a "block" in the json, so you can get a better picture of how it's implemented by searching for it.

It should be easy to write out any simple (primitive type) decoration you add to a jet, the EventInfo object, or a track.

Writing out a decoration#

The writer classes, like most code in this package, are configured via a json file.

Each configuration file has an object called "variables", which specifies the per-jet outputs. These are also specified by type: there is one list for "floats", one for "chars", etc. We assume that the variables are stored on the BTagging object by default, but there are also "jet_int_variables" and "jet_floats" for anything on the jet itself. An "event" field specifies information that should be read off the EventInfo object.

Track-wise variables are specified within a similar structure within the "tracks" field.

Nothing is saved to output files by default: you need to add whatever you've decorated to the b-tagging object to the output list. If the output isn't found on the xAOD, the code should immediately throw an exception.

Editing Athena packages#

You can modify existing Athena packages by building them locally alongside the code in this package. You'll need to check out a local copy of the athena repository. You should do this in the root directory of the package source, i.e. alongside docs/, configs, README.md, etc.

You can use git-fatlas:

git-fatlas-init -r 23.0
git-fatlas-add path/to/package

or use git atlas which should be accessible via lsetup git. Note that you might have to manually delete the Projects directory if you rely on git atlas.

Editing the documentation#

The documentation is provided by mkdocs, and deployed via CERN Gitlab. For any larger edits to the documentation we recommend running mkdocs locally. It can be installed with pip:

pip install -r docs/requirements.txt

and then launched from the root directory of this project

mkdocs serve

This will launch a server and provide you with a local URL to view the pages.

Formatting code#

You are encouraged to format your code during developement.

Formatting C++#

Formatting C++ code can be done using clang-format. You can set it up with the following command:

lsetup clang

To format over a single file, simply run:

clang-format -i <file>

To format over all files in a directory, you can use find and xargs:

find ./ -iname *.h -o -iname *.hh -o -iname *.cxx | xargs clang-format -i

Formatting Python#

Formatting Python code can be done using ruff. First install the package:

cd training-dataset-dumper
pip3 install -r requirements.txt

To format a file or directory, simply run:

python3 -m ruff format <path>

Debugging#

Sometimes you may get some unknown error, for example when switching to a more recent sample. This can result in unclear error messages, for example

ERROR SG::ExcInvalidLink: Attempt to dereference invalid DataLink / ElementLink [1241842700/] (125644220

Such errors don't hint at their exact cause. In such cases, you can use the built in debugger. To run with the debugger:

  • build with Athena
  • run your executable ( e.g. dump-single-btag ) with the -g or --deubgger argument

This will attatch a debugger. Once this has finished setting up, you can type

catch throw
This will instruct the debugger to catch all exceptions when they occur, allowing you to investigate what caused them. When an exception occurs, you can then type:

backtrace
or bt. This will then print the full stack trace where the exception occurs.

The

There seems to be some series of exceptions which are usually caught by Athena which we can ignore. For example,

[Everything getting setup]
DBReplicaSvc                                         INFO Total of 1 servers found for host dias.hpc.phys.ucl.ac.uk [ATLF ]
PoolSvc                                              INFO Successfully setup replica sorting algorithm
PoolSvc                                              INFO Setting up APR FileCatalog and Streams

Catchpoint 1 (exception thrown), 0x00007f4032bee350 in __cxa_throw () from /cvmfs/sft.cern.ch/lcg/releases/gcc/13.1.0-b3d18/x86_64-el9/lib64/libstdc++.so.6
(gdb) bt
#0  0x00007f4032bee350 in __cxa_throw () from /cvmfs/sft.cern.ch/lcg/releases/gcc/13.1.0-b3d18/x86_64-el9/lib64/libstdc++.so.6
#1  0x00007f400528c3d7 in xercesc_3_2::Reade§rMgr::popReader (this=this@entry=0x10883cb8)
    at /build/jenkins/workspace/lcg_release_pipeline/build/externals/XercesC-3.2.4/src/XercesC/3.2.4/src/xercesc/internal/ReaderMgr.cpp:1086
#2  0x00007f40053217c0 in xercesc_3_2::ReaderMgr::getSpaces (this=0x10883cb8, toFill=...)

If you see something like this, you can type continue until the next exception is found.

An example, the debugger was used to find the issue in this merge request. To find the error we cared about, we had to continue through an exception that looked like the one above, twice. The third exception we backtraced then informed the actual cause of the error.