Data Management Tutorial

Grid Induction
NA3 Training Team, Israel

So far we've used files located on the UI to compute our task.
These were transferred to the WN via the Input / Output Sandbox jdl attribute.
This is not suitable for data files larger than 10Mb.

After this tutorial you will be able to use files stored on the Grid for your computational task and store files created by your job on Grid SE.

In this tutorial we will run a complete example that

Finding Storage Elements

Run the following command to list the accessible SE:

% lcg-infosites --vo MyVO se


The output of the command would look like:

Avail Space(Kb)Used Space(Kb)TypeSEs
----------------------------------------------------------------------
16070000000n.an.ase.phy.bg.ac.yu
88814387284934656n.ase001.ipp.acad.bg
163421592468738609n.ase03.grid.acad.bg
230000000020226331n.aplethon.grid.ucy.ac.cy
75518264152634624n.anode004.grid.auth.gr
2247308768194503424n.ase.hep.ntua.gr
...

lcg-infosites can query other things like CEs, CEs close to SEs etc.
For more info type lcg-infosites -help

Creating a directory

Before we upload the files, each of you will create your own directory on the file catalog.
The file catalog convention is that all logical file names start with /grid/MyVO/ as the base directory. However, some VOs require further hierarchy.
Gilda VO will be under /grid/gilda/tutorials as base name.
Put your user name as your base home directory.
The following command is used to create a directory on the file catalog:

% lfc-mkdir -p /grid/MyVO/MyUserName

(e.g. lfc-mkdir -p /grid/see/assaf for see VO or lfc-mkdir -p /grid/gilda/tutorials/assaf for gilda VO)
Be'er sheva users: Please see Be'er Sheva-specific base directory

The flags are similar to regular mkdir, so -p means create the parents if they do not exist.

Uploading files to the directory

The following command uploads a file (i.e., transfers it from your UI to a Storage Element) and registers it on the file catalog.

% lcg-cr -d MySEName -l lfn:/grid/MyVO/MyUserName/MyFileName --vo MyVO MySrcFilePath


MySEName = the name of the SE, chosen from the list provided by lcg-infosites.
lfn = the logical file name. Must be of the form lfn:/grid/MyVO/MyFileName
--vo = Virtual Organization.
MySrcFilePath = the file you want to upload from the UI. It must have the following format: file:AbsolutePath

Let's try uploading a file from the UI to the Grid.
Create a file containing your name and save it as "name.txt" (e.g., % echo blablabla > assaf.txt).
To upload the file type the following command:

% lcg-cr -d MySEName -l lfn:/grid/MyVO/MyUserName/MyFileName --vo MyVO file:/home/MyUserName/MyFileName


(e.g., for see VO: lcg-cr -d plethon.grid.ucy.ac.cy -l lfn:/grid/see/assaf/assaf.txt --vo see file:`pwd`/assaf.txt).

The output of the command is the guid of the uploaded file:
guid:399099e7-6333-45e9-8029-67aa6097d16a

Please download the following files to your account:

We will use the lcg-cr command to place the files needed on the grid before running.
You can use the shell script below, while replacing the SE with a chosen one (you can use more than one)


#!/bin/sh

lcg-cr -d MySEName --vo MyVO -l lfn:/grid/MyVO/MyUserName/proteins1.fasta file:`pwd`/proteins1.fasta
lcg-cr -d MySEName --vo MyVO -l lfn:/grid/MyVO/MyUserName/proteins2.fasta file:`pwd`/proteins2.fasta
lcg-cr -d MySEName --vo MyVO -l lfn:/grid/MyVO/MyUserName/blosum62.txt file:`pwd`/blosum62.txt
lcg-cr -d MySEName --vo MyVO -l lfn:/grid/MyVO/MyUserName/alignment.exe file:`pwd`/ariadne


Browsing your directory

If you want to view the content of your directory, you can use

% lfc-ls -l /grid/MyVO/MyUserName


You can always get the guid from lfn or surl or vice-versa.
For example the lcg-lg (list GUID) command returns the guid associated with a specified lfn or surl.
Use the lfn you've given to the file you uploaded to the SE to retrieve its guid.

% lcg-lg --vo MyVO lfn:/grid/MyVO/MyUserName/MyFileName


(e.g. for see VO, lcg-lg --vo see lfn:/grid/see/assaf/assaf.txt).

The output of the command is:

guid:399099e7-6333-45e9-8029-67aa6097d16a

Tip: If you don't want to type the entire path of the lfn every time, you can set the following environment variable to your base directory:

% export LFC_HOME=/grid/MyVO/MyUserName


This environment variable will be automatically concatenated to your directories and lfn for all commands that use them.
For example lfc-ls will give you the content of this directory

Preparing the jdl

Now let's prepare the jdl for the job we are going to run.
Create the following jdl alignment.jdl while replacing with the correct VO, SE and file lfns and place it under your home directory.

Type="Job";
JobType="Normal";
VirtualOrganisation="MyVO";
Executable="/bin/sh";
Arguments="alignment.sh alignment.exe proteins1.fasta proteins2.fasta blosum62.txt my_alignment";
StdOutput="alignment.out";
StdError="alignment.err";
InputSandbox={"alignment.sh"};
OutputSandbox={"alignment.err", "alignment.out"};
RetryCount=3;
ShallowRetryCount=7;
DataRequirements = {
        [
        DataCatalogType = "DLI";
        InputData = {"lfn:/grid/MyVO/MyUserName/proteins1.fasta",
                "lfn:/grid/MyVO/MyUserName/proteins2.fasta",
                "lfn:/grid/MyVO/MyUserName/alignment.exe",
                "lfn:/grid/MyVO/MyUserName/blosum62.txt"
                };
        ]
};
DataAccessProtocol = {"rfio","gsiftp","gsidcap","https"};
OutputSE="<SE_name>";


As you can see, it's a normal job, running the shell script alignment.sh with the relevant attributes. The shell script is described below.

Now, let's prepare the shell file alignment.sh that the jdl is running.
The script
  1. Downloads the required files from the Grid (by using the lfn) to the WN (where the computation itself takes place).
  2. Runs the alignment.exe using the protein sequences in the files proteins1.fasta and proteins2.fasta and using blosum62.txt.
  3. When the program terminates, the output file is uploaded to the SE and registered in the file catalog (using the lcg-cr command).


Pay attention that the environment variables LCG_GFAL_INFOSYS and LFC_HOST are VO dependant.
Use these commands to query them:

%lcg-infosites --vo MyVO tag
%lcg-infosites --vo MyVO lfc

The first value in the % lcg-infosites --vo MyVO tag command is the requested value.
You can verify that they match the already set environment variables:

%echo $LCG_GFAL_INFOSYS
%echo $LFC_HOST


The values in the script below take the values on the CE
You can specify directly the values as follows:
For gilda VO, these are
export LCG_GFAL_INFOSYS="glite-rb.ct.infn.it:2170";
export LFC_HOST="lfc-gilda.ct.infn.it";
For see VO, these are
export LCG_GFAL_INFOSYS="bdii.isabella.grnet.gr:2170";
export LFC_HOST="lfc.isabella.grnet.gr";

#!/bin/sh

##@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@##
## Description:
## this program executes a local alignment algorithm on the allocated WN.
##
## input parameters:
## $1 = name of exe file
## $2 = name of first alignment file (in fasta format)
## $3 = name of second alignment file (in fasta format)
## $4 = name of substitution matrix (blosum62)
## $5 = name of output file to which the alignment will be flushed
##@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@##

# read the input parameter
localAlignExe=$1;
proteins1=$2;
proteins2=$3;
blosum62=$4;
output_file=$5;

# ensuring that the correct catalog service is used.
export LCG_GFAL_INFOSYS=`lcg-infosites --vo MyVO tag | head -1 | sed 's/valor del bdii: //g'`;
export LFC_HOST=`lcg-infosites --vo MyVO lfc`;
export LCG_CATALOG_TYPE="lfc";

# creating an empty file on the WN.
touch $output_file;

# downloading the files from SE to the WN.
lcg-cp --vo MyVO lfn:/grid/MyVO/MyUserName/$proteins1 file:`pwd`/$proteins1;
lcg-cp --vo MyVO lfn:/grid/MyVO/MyUserName/$proteins2 file:`pwd`/$proteins2;
lcg-cp --vo MyVO lfn:/grid/MyVO/MyUserName/$blosum62 file:`pwd`/$blosum62;
lcg-cp --vo MyVO lfn:/grid/MyVO/MyUserName/$localAlignExe file:`pwd`/$localAlignExe;

chmod 755 $localAlignExe;

# running the executable.
./$localAlignExe -mode ss2ss -seq $proteins1 -seq2 $proteins2 -matrix $blosum62 -A 11 -B 1 -ethresh 1.0e3 -dbsize 1 -align >> $output_file;

# uploading the output to the SE.
lcg-cr --vo MyVO -d MySEName -l lfn:/grid/MyVO/MyUserName/$output_file file:`pwd`/$output_file;


if [ $? -eq 0 ]
then
echo "the output file was successfully copied to the SE"
else
echo "unable to copy output file to the SE"
fi



Running the jdl

Now (finally) we are prepared to run the job by using the glite-wms-job-submit command.

% glite-wms-job-submit -a -o MyUserName.id alignment.jdl


As we've done previously, use the glite-wms-job-status command to check the progress of the job:

% glite-wms-job-status -i MyUserName.id


After the job terminates (i.e., Current Status: Done (Success)), check the contents of your directory to make sure that the output file is there by using:

% lfc-ls /grid/ MyVO/MyUserName


Let's retrieve the output of our job (the one that was sent via the OutputSandbox) by using the following command:

% glite-wms-job-output --dir <directory_name> -i <username.id >


Assuming we got string "the output file was successfully copied to the SE" in the alignment.out, let's retrieve the output file which was stored on the Grid:

% lcg-cp --vo MyVO lfn:/grid/ MyVO/MyUserName/MyFileName file:/home/MyUserName/MyFileName


The output file is: my_alignment

Do you remember how to clear your files and directory you've created?
For files use:

% lcg-del -a -v --vo MyVO lfn:/grid/ MyVO/MyUserName/MyFileName

For the empty directory

% lfc-rm --vo MyVO /grid/ MyVO/MyUserName

Congratulations. This is essentially what is needed to know for using the grid.

If you are feeling in the mood, let's learn about Advanced topics