DOMAINALIGN documentation


 


CONTENTS

1.0 SUMMARY
2.0 INPUTS & OUTPUTS
3.0 INPUT FILE FORMAT
4.0 OUTPUT FILE FORMAT
5.0 DATA FILES
6.0 USAGE
7.0 KNOWN BUGS & WARNINGS
8.0 NOTES
9.0 DESCRIPTION
10.0 ALGORITHM
11.0 RELATED APPLICATIONS
12.0 DIAGNOSTIC ERROR MESSAGES
13.0 AUTHORS
14.0 REFERENCES



1.0 SUMMARY

Generates DAF files (domain alignment files) of structure-based sequence alignments for nodes in a DCF file (domain classification file)


2.0 INPUTS & OUTPUTS

DOMAINALIGN reads a DCF file (domain classification file) and generates a structure-based sequence alignment annotated with domain classification data ('domain alignment file') for each user-defined node (e.g. family or superfamily) in the DCF file in turn. If the STAMP algorithm is used, structural superimpositions are also generated and saved to file (PDB format). The alignments are calculated by using stamp or TCOFFEE and these applications must be installed on the system that is running DOMAINALIGN (see 'Notes' below).
Clearly no alignment can be generated for nodes with a single entry (domain) only: sequences for such domains are (optionally) written to file (fasta format).
DOMAINALIGN requires a directory of domain PDB files; the path and extension of these must be set by the user (via the ACD file) and also specified in the stamp "pdb.directories" file (see 'Notes' below)
A log file of diagnostic messages is written. The identifier (e.g SCOP Sunid) of the nodes from the DCF file are used to name the output files. The user also specifies the input file, paths for the two types of alignment files (output), path of singlet sequence files (if output) and name of log file.


3.0 INPUT FILE FORMAT

The format of the domain classification file is described in scopparse.c


4.0 OUTPUT FILE FORMAT

Structure-based sequence alignment
This file (Figure 1) is in EMBOSS "simple" multiple sequence alignment format. This is similar to the output file generated by stamp when issued with the following three types of command:
 (1) stamp -l ./stamps_file.dom -s -n 2 -slide 5 -prefix ./stamps_file -d 
 ./stamps_file.set;sorttrans -f ./stamps_file.scan -s Sc 2.5 > 
 ./stamps_file.sort;stamp -l ./stamps_file.sort -prefix ./stamps_file > 
 ./stamps_file.log

 (2) poststamp -f ./stamps_file.3 -min 0.5

 (3) ver2hor -f ./stamps_file.3.post > ./stamps_file.out

The DOMAINALIGN output file (Figure 1) displays the sequence names, positions and sequences. The names are the 7 character domain identifier codes taken from the domain classification file. The positions are the start and end residue positions of the appropriate section of sequence. The sequence uses '-' as a gap character. The domain classification records for the appopriate node from the DCF file are given above the alignment. The STAMP 'Post similar' line is given as a markup line underneath the sequence but no dssp assignments are written. All lines other than sequence lines begin with '#' to denote a comment.


5.0 DATA FILES

DOMAINALIGN does not use any data files but uses the stamp "pdb.directories" file which specifies the permissible prefix, extension and path of PDB files used by STAMP. On the RFCGR system, this file is /packages/stamp/defs/pdb.directories and should look like :
 test_data/ - .dent
 /data/pdb - -
 /data/pdb _ .ent
 /data/pdb _ .pdb
 /data/pdb pdb .ent
 /data/pdbscop _ _
 /data/pdbscop _ .ent
 /data/pdbscop _ .pdb
 /data/pdbscop pdb .ent
 ./ _ _
 ./ _ .ent
 ./ _ .ent.z
 ./ _ .ent.gz
 ./ _ .pdb
 ./ _ .pdb.Z
 ./ _ .pdb.gz
 ./ pdb .ent
 ./ pdb .ent.Z
 ./ pdb .ent.gz
 /data/CASS1/pdb/coords/ _ .pdb
 /data/CASS1/pdb/coords/ _ .pdb.Z
 /data/CASS1/pdb/coords/ _ .pdb.gz




6.0 USAGE

6.1 COMMAND LINE ARGUMENTS

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-dcfinfile]         infile     This option specifies the name of DCF file
                                  (domain classification file) (input). A
                                  'domain classification file' contains
                                  classification and other data for domains
                                  from SCOP or CATH, in DCF format
                                  (EMBL-like). The files are generated by
                                  using SCOPPARSE and CATHPARSE. Domain
                                  sequence information can be added to the
                                  file by using DOMAINSEQS.
  [-pdbdir]            directory  This option specifies the location of domain
                                  PDB files (input). A 'domain PDB file'
                                  contains coordinate data for a single domain
                                  from SCOP or CATH, in PDB format. The files
                                  are generated by using DOMAINER.
   -node               menu       This option specifies the node for
                                  redundancy removal. Redundancy can be
                                  removed at any specified node in the SCOP or
                                  CATH hierarchies. For example by selecting
                                  'Class' entries belonging to the same Class
                                  will be non-redundant.
   -mode               menu       This option specifies the alignment
                                  algorithm to use.
   -[no]keepsinglets   toggle     This option specifies whether to write
                                  sequences of singlet families to file. If
                                  you specify this option, the sequence for
                                  each singlet family are written to file
                                  (output).
*  -singletsoutdir     outdir     This option specifies the location of DHF
                                  files (domain hits files) for singlet
                                  sequences (output). The singlets are written
                                  out as a 'domain hits file' - which
                                  contains database hits (sequences) with
                                  domain classification information, in FASTA
                                  format.
  [-dafoutdir]         outdir     This option specifies the location of DAF
                                  files (domain alignment files) (output). A
                                  'domain alignment file' contains a sequence
                                  alignment of domains belonging to the same
                                  SCOP or CATH family. The files are in
                                  clustal format and are annotated with domain
                                  family classification information. The
                                  files generated by using SCOPALIGN will
                                  contain a structure-based sequence alignment
                                  of domains of known structure only. Such
                                  alignments can be extended with sequence
                                  relatives (of unknown structure) by using
                                  SEQALIGN.
*  -superoutdir        outdir     This option specifies the location of
                                  structural superimposition files (output). A
                                  file in PDB format of the structural
                                  superimposition is generated for each family
                                  if the STAMP algorithm is used.
   -logfile            outfile    This option specifies the name of log file
                                  (output). The log file contains messages
                                  about any errors arising while domainalign
                                  ran.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-logfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths


Standard (Mandatory) qualifiers Allowed values Default
[-dcfinfile]
(Parameter 1)
This option specifies the name of DCF file (domain classification file) (input). A 'domain classification file' contains classification and other data for domains from SCOP or CATH, in DCF format (EMBL-like). The files are generated by using SCOPPARSE and CATHPARSE. Domain sequence information can be added to the file by using DOMAINSEQS. Input file Required
[-pdbdir]
(Parameter 2)
This option specifies the location of domain PDB files (input). A 'domain PDB file' contains coordinate data for a single domain from SCOP or CATH, in PDB format. The files are generated by using DOMAINER. Directory ./
-node This option specifies the node for redundancy removal. Redundancy can be removed at any specified node in the SCOP or CATH hierarchies. For example by selecting 'Class' entries belonging to the same Class will be non-redundant.
1 (Class (SCOP))
2 (Fold (SCOP))
3 (Superfamily (SCOP))
4 (Family (SCOP))
5 (Class (CATH))
6 (Architecture (CATH))
7 (Topology (CATH))
8 (Homologous Superfamily (CATH))
9 (Family (CATH))
1
-mode This option specifies the alignment algorithm to use.
1 (STAMP)
2 (TCOFFEE)
1
-[no]keepsinglets This option specifies whether to write sequences of singlet families to file. If you specify this option, the sequence for each singlet family are written to file (output). Toggle value Yes/No Yes
-singletsoutdir This option specifies the location of DHF files (domain hits files) for singlet sequences (output). The singlets are written out as a 'domain hits file' - which contains database hits (sequences) with domain classification information, in FASTA format. Output directory  
[-dafoutdir]
(Parameter 3)
This option specifies the location of DAF files (domain alignment files) (output). A 'domain alignment file' contains a sequence alignment of domains belonging to the same SCOP or CATH family. The files are in clustal format and are annotated with domain family classification information. The files generated by using SCOPALIGN will contain a structure-based sequence alignment of domains of known structure only. Such alignments can be extended with sequence relatives (of unknown structure) by using SEQALIGN. Output directory ./
-superoutdir This option specifies the location of structural superimposition files (output). A file in PDB format of the structural superimposition is generated for each family if the STAMP algorithm is used. Output directory ./
-logfile This option specifies the name of log file (output). The log file contains messages about any errors arising while domainalign ran. Output file domainalign.log
Additional (Optional) qualifiers Allowed values Default
(none)
Advanced (Unprompted) qualifiers Allowed values Default
(none)

6.2 EXAMPLE SESSION

An example of interactive use of DOMAINALIGN is shown below.


7.0 KNOWN BUGS & WARNINGS

1. Use of stamp
DOMAINALIGN requires a modified version of stamp (see Notes below). The modified stamp application must be installed on the system that is running DOMAINALIGN.

When running DOMAINALIGN at the RFCGR, to ensure the modified version of stamp is used, type 'use stamp2' (which runs the script /packages/menu/USE/stamp2) before DOMAINALIGN is run.

2. Strange stamp behaviour
stamp will ignore (omit from the alignment and *not* replace with '-' or any other symbol) ANY residues or groups in a PDB file that

(i) are not structured (i.e. do not appear in the ATOM records) or
(ii) lack a CA atom, regardless of whether it is a known amino acid or not.

This means that the position (column) in the alignment cannot reliably be used as the basis for an index into arrays representing the full length sequences. stamp will however include in the alignment residues with a single atom only, so long as it is the CA atom.

3. Handling of singlet nodes
No sequence alignment or structural superimposition files are generated for nodes that contain a single domain only. Sequences for such domains can be saved to file (see 2.0 INPUTS & OUTPUTS).

4. Alignment numbering
Residue number positions in alignment are not implemented (zero's are given).


8.0 NOTES

1. Adaption of STAMP for domain codes
DOMAINALIGN will only run with with a version of stamp which has been modified so that PDB id codes of length greater than 4 characters are acceptable. This involves a trivial change to the stamp module getdomain.c (around line number 155), a 4 must be changed to a 7 as follows:
 temp=getfile(domain[0].id,dirfile,4,OUTPUT); 
 temp=getfile(domain[0].id,dirfile,7,OUTPUT); 

2. Adaption of STAMP for larger datasets
STAMP fails to align a large dataset of all the available V set Ig domains. The ver2hor module generates the following error:
 Transforming coordinates...
  ...done.
 ver2hor -f ./domainalign-1022069396.11280.76.post > ./domainalign-1022069396.11280.out
 error: something wrong with STAMP file
          STAMP length is 370, Alignment length is 422
          STAMP nseq is 155, Alignment nseq is 155

This is fixed by the following change in alignfit.h.
#define MAXtlen 200 
#define MAXtlen 2000

At the same time the following may be changed as a safety measure:
 gstamp.c  : #define MAX_SEQ_LEN 10000    (was 2000)
 pdbseq.c  : #define MAX_SEQ_LEN 10000    (was 3000)
 defaults.h: #define MAX_SEQ_LEN 10000    (was 8000)
 defaults.h: #define MAX_NSEQ 10000       (was 1000)
 defaults.h: #define MAX_BLOC_SEQ 5000    (was 500)
 dstamp.h  : #define MAX_N_SEQ 10000      (was 1000)
 ver2hor.h : #define MAX_N_SEQ 10000      (was 1000)

The modified code (for 2. and 3. above) is kept on the HGMP file system in /packages/stamp/src2

WHEN RUNNING DOMAINALIGN AT THE HGMP IT IS ESSENTIAL THAT THE COMMAND 'use stamp2' (which runs the script /packages/menu/USE/stamp2) IS GIVEN BEFORE DOMAINALIGN IS RUN.
This will ensure that the modified version of stamp is used.

3. pdb.directories file
stamp (and therefore DOMAINALIGN) uses a "pdb.directories" file: see 5.0 DATA FILES

4. Choice of alignment algorithm
Future versions of DOMAINALIGN will implement a larger choice of alignment algorithms.

5. Getting the best alignment
DOMAINALIGN will produce better alignments if the DCF file is reordered so that the representative structure of each node (e.g. family) is given first. This is achieved by using DOMAINREP.

6. Whitespace in alignment
STAMP can insert non-sensical whitespaces into its alignments, e.g. instead of a residue character where that residue was missing electron density in the PDB file. DOMAINALIGN replaces each whitespace within a STAMP alignment with an "X".

8.1 GLOSSARY OF FILE TYPES

FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
Domain classification file (for SCOP) DCF format (EMBL-like format for domain classification data). Classification and other data for domains from SCOP. SCOPPARSE Domain sequence information can be added to the file by using DOMAINSEQS.
Domain classification file (for CATH) DCF format (EMBL-like format for domain classification data). Classification and other data for domains from CATH. CATHPARSE Domain sequence information can be added to the file by using DOMAINSEQS.
Domain PDB file PDB format for domain coordinate data. Coordinate data for a single domain from SCOP or CATH. DOMAINER N.A.
Domain alignment file DAF format (clustal format with domain classification information). Contains a sequence alignment of domains belonging to the same SCOP or CATH family. The file is annotated with domain family classification information. DOMAINALIGN (structure-based sequence alignment of domains of known structure). DOMAINALIGN alignments can be extended with sequence relatives (of unknown structure) to the family in question by using SEQALIGN.
None


9.0 DESCRIPTION

The generation of alignments for large datasets such as SCOP and CATH potentially requires a lot of time for preparation of datasets, writing of scripts, running individual jobs and so on, in addition to the compute time required for the alignments themselves. DOMAINALIGN automates this process: it reads a domain classification file and generates alignments for each user-specified node in turn.


10.0 ALGORITHM

More information on stamp can be found at http://www.compbio.dundee.ac.uk/manuals/stamp.4.2
More information on TCOFFEE can be found at http://www.ch.embnet.org/software/TCoffee.html


11.0 RELATED APPLICATIONS

See also

Program nameDescription
contactcountCounts specific versus non-specific contacts in a directory of cleaned protein chain contact files
contactsReads CCF files (clean coordinate files) and writes CON files (contact files) of intra-chain residue-residue contact data
domainrepReorder DCF file (domain classification file) so that the representative structure of each user-specified node is given first
domainresoRemoves low resolution domains from a DCF file (domain classification file)
interfaceReads CCF files (clean coordinate files) and writes CON files (contact files) of inter-chain residue-residue contact data
libgenGenerates various types of discriminating elements for each alignment in a directory
psiphiCalculates phi and psi torsion angles from cleaned EMBOSS-style protein co-ordinate file
roconReads a DHF file (domain hits file) of hits (sequences of unknown structural classification) and a DHF file of validation sequences (known classification) and writes a 'hits file' for the hits, which are classified and rank-ordered on the basis of score
rocplotProvides interpretation and graphical display of the performance of discriminating elements (e.g. profiles for protein families). rocplot reads file(s) of hits from discriminator-database search(es), performs ROC analysis on the hits, and writes graphs illustrating the diagnostic performance of the discriminating elements
seqalignReads a DAF file (domain alignment file) and a DHF file (domain hits file) and writes a DAF file extended with the hits
seqfraggleRemoves fragments from DHF files (domain hits files) or other files of sequences
seqsearchGenerate database hits (sequences) for nodes in a DCF file (domain classification file) by using PSI-BLAST
seqsortReads DHF files (domain hits files) of database hits (sequences) and removes hits of ambiguous classification
seqwordsGenerates DHF files (domain hits files) of database hits (sequences) for nodes in a DCF file (domain classification file) by keyword search of UniProt
siggenGenerates a sparse protein signature from an alignment and residue contact data
sigscanGenerates a DHF file (domain hits file) of hits (sequences) from scanning a signature against a sequence database



12.0 DIAGNOSTIC ERROR MESSAGES

The following message may appear in the log file.

Replaced ' ' in STAMP alignment with 'X' (STAMP can insert non-sensical whitespaces into its alignments, e.g. instead of a residue character where that residue was missing electron density in the PDB file. DOMAINALIGN replaces each whitespace within a STAMP alignment with an "X").


13.0 AUTHORS

Ranjeeva Ranasinghe (rranasin@rfcgr.mrc.ac.uk)

Jon Ison (jison@rfcgr.mrc.ac.uk)
MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK


14.0 REFERENCES

Please cite the authors and EMBOSS.

Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.

See also http://emboss.sourceforge.net/

14.1 Other useful references


Russell, R. B. & Barton, G. J. (1992), Multiple Sequence Alignment from Tertiary Structure Comparison: Assignment of Global and Residue Confidence Levels, PROTEINS: Struct. Funct. Genet., 14, 309-323.
C. Notredame, D. Higgins, J. Heringa. T-Coffee: A novel method for multiple sequence alignments. Journal of Molecular Biology, 302, 205-217, (2000)

More information on stamp can be found at http://www.compbio.dundee.ac.uk/manuals/stamp.4.2/
More information on TCOFFEE can be found at http://www.ch.embnet.org/software/TCoffee.html