![]() |
SIGGEN documentation |
TY SCOP XX CL Alpha and beta proteins (a+b) XX FO Ferredoxin-like XX SF Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain XX FA Aspartate carbamoyltransferase, Regulatory-chain, N-terminal domain XX SI 54894 XX NP 14 XX NN [1] XX IN NRES 1 ; NGAP 2 ; WSIZ 0 XX AA G ; 5 XX GA 6 ; 3 GA 7 ; 2 XX NN [2] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA D ; 5 XX GA 3 ; 5 XX NN [3] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA H ; 5 XX GA 0 ; 5 XX NN [4] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA P ; 5 XX GA 1 ; 5 XX NN [5] XX IN NRES 1 ; NGAP 3 ; WSIZ 0 [Part of this file has been deleted for brevity] XX GA 4 ; 5 XX NN [9] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA P ; 5 XX GA 16 ; 5 XX NN [10] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA T ; 5 XX GA 2 ; 5 XX NN [11] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA N ; 5 XX GA 1 ; 5 XX NN [12] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA Y ; 5 XX GA 4 ; 5 XX NN [13] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA K ; 5 XX GA 4 ; 5 XX NN [14] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA P ; 5 XX GA 5 ; 5 // |
TY SCOP XX CL Alpha and beta proteins (a+b) XX FO Ferredoxin-like XX SF Adenylyl and guanylyl cyclase catalytic domain XX FA Adenylyl and guanylyl cyclase catalytic domain XX SI 55074 XX NP 30 XX NN [1] XX IN NRES 1 ; NGAP 4 ; WSIZ 0 XX AA F ; 6 XX GA 9 ; 2 GA 11 ; 1 GA 17 ; 1 GA 19 ; 2 XX NN [2] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA D ; 5 AA S ; 1 XX GA 1 ; 6 XX NN [3] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA I ; 6 XX GA 0 ; 6 XX NN [4] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA F ; 5 AA S ; 1 XX GA 2 ; 6 [Part of this file has been deleted for brevity] IN NRES 3 ; NGAP 1 ; WSIZ 0 XX AA W ; 3 AA Y ; 1 AA F ; 2 XX GA 2 ; 6 XX NN [26] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA A ; 6 XX GA 6 ; 6 XX NN [27] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA E ; 5 AA D ; 1 XX GA 3 ; 6 XX NN [28] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA I ; 4 AA V ; 2 XX GA 7 ; 6 XX NN [29] XX IN NRES 2 ; NGAP 1 ; WSIZ 0 XX AA T ; 5 AA A ; 1 XX GA 5 ; 6 XX NN [30] XX IN NRES 1 ; NGAP 1 ; WSIZ 0 XX AA L ; 6 XX GA 3 ; 6 // |
Standard (Mandatory) qualifiers (* if not always prompted): [-algpath] dirlist This option specifies the location of DAF files (domain alignment files) (input). A 'domain alignment file' contains a sequence alignment of domains belonging to the same SCOP or CATH family (or other node in the structural hierarchies). The file is in DAF format (CLUSTAL-like) and is annotated with domain family classification information. The files generated by using SCOPALIGN will contain a structure-based sequence alignment of domains of known structure only. Such alignments can be extended with sequence relatives (of unknown structure) by using SEQALIGN. -mode menu This option specifies the mode of signature generation. There are 3 modes for signatures generatation: (1) Use positions specified in alignment file. The alignment file must contain a line beginning with the text 'Positions' for each line of the alignment. A '1' in the 'Positions' line indicates that the signature should include data from the corresponding alignment site. The signature will only include the positions that are marked with a '1'. (2) Use a scoring method. The alignment is scored (see 'Algorithm') and the signature of a specified sparsity is sampled from high scoring positions. (3): Generate a randomised signature. A signature of a specified sparsity is sampled at random from the alignment. * -conoption menu This option specifies the structure-based scoring scheme. SIGGEN provides 2 structure-based scoring schemes (plus a combination method) that are used to score the input alignment. * -conpath directory This option specifies the location of CON files (contact files) (input). A 'contact file' contains contact data for a protein or a domain from SCOP or CATH, in the CON format (EMBL-like). The contacts may be intra-chain residue-residue, inter-chain residue-residue or residue-ligand. The files are generated by using CONTACTS, INTERFACE and FUNKY. * -cpdbpath directory This option specifies the location of domain CCF files (clean coordinate files) (input). A 'clean cordinate file' contains protein coordinate and derived data for a single PDB file ('protein clean coordinate file') or a single domain from SCOP or CATH ('domain clean coordinate file'), in CCF format (EMBL-like). The files, generated by using PDBPARSE (PDB files) or DOMAINER (domains), contain 'cleaned-up' data that is self-consistent and error-corrected. Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. * -datafile matrixf This option specifies the the substitution matrix. The substitution matrix is used by the sequence-based scoring schemes. * -sparsity integer This option specifies the % sparsity of signature. The signature sparsity is a user-defined parameter that determines how many residues the final signature will contain, for example, if the average sequence length of the proteins in the alignment is 250 residues, then a signature of sparsity 10% (default value) will contain 25 key residues or signature positions, that correspond to the top 25% highest scoring alignment positions. -wsiz integer This option specifies the window size. When a signature is aligned to a protein sequence, the permissible gaps between two signature positions is determined by the empirical gaps and the window size. The user is prompted for a window size that is used for every position in the signature. Likely this is not optimal. A future implementation will provide a range of methods for generating values of window size depending upon the alignment (window size is identified by the WSIZ record in the signature output file). * -seqoption menu This option specifies the sequence-based scoring scheme. SIGGEN provides 2 sequence-based scoring schemes that are used to score the input alignment. * -filtercon toggle This option specifies whether to disregard positions forming few contacts only during the selection of signature positions. * -conthresh integer This option specifies the threshold contact number. This controls the selection of key positions for the structure-based scoring scheme (number of contacts). * -[no]filterpsim boolean This option specifies whether to disregard alignment sites that were not aligned satisfactorily (STAMP alignments only). [-sigoutdir] outdir This option specifies the location of signature files (output). A 'signature file' contains a sparse sequence signature suitable for use with the SIGSCAN program. The files are generated by using SIGGEN. Additional (Optional) qualifiers: (none) Advanced (Unprompted) qualifiers: (none) Associated qualifiers: (none) General qualifiers: -auto boolean Turn off prompts -stdout boolean Write standard output -filter boolean Read standard input, write standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report deaths
Standard (Mandatory) qualifiers | Allowed values | Default | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[-algpath] (Parameter 1) |
This option specifies the location of DAF files (domain alignment files) (input). A 'domain alignment file' contains a sequence alignment of domains belonging to the same SCOP or CATH family (or other node in the structural hierarchies). The file is in DAF format (CLUSTAL-like) and is annotated with domain family classification information. The files generated by using SCOPALIGN will contain a structure-based sequence alignment of domains of known structure only. Such alignments can be extended with sequence relatives (of unknown structure) by using SEQALIGN. | Directory with files | ./ | ||||||||||
-mode | This option specifies the mode of signature generation. There are 3 modes for signatures generatation: (1) Use positions specified in alignment file. The alignment file must contain a line beginning with the text 'Positions' for each line of the alignment. A '1' in the 'Positions' line indicates that the signature should include data from the corresponding alignment site. The signature will only include the positions that are marked with a '1'. (2) Use a scoring method. The alignment is scored (see 'Algorithm') and the signature of a specified sparsity is sampled from high scoring positions. (3): Generate a randomised signature. A signature of a specified sparsity is sampled at random from the alignment. |
|
1 | ||||||||||
-conoption | This option specifies the structure-based scoring scheme. SIGGEN provides 2 structure-based scoring schemes (plus a combination method) that are used to score the input alignment. |
|
5 | ||||||||||
-conpath | This option specifies the location of CON files (contact files) (input). A 'contact file' contains contact data for a protein or a domain from SCOP or CATH, in the CON format (EMBL-like). The contacts may be intra-chain residue-residue, inter-chain residue-residue or residue-ligand. The files are generated by using CONTACTS, INTERFACE and FUNKY. | Directory | ./ | ||||||||||
-cpdbpath | This option specifies the location of domain CCF files (clean coordinate files) (input). A 'clean cordinate file' contains protein coordinate and derived data for a single PDB file ('protein clean coordinate file') or a single domain from SCOP or CATH ('domain clean coordinate file'), in CCF format (EMBL-like). The files, generated by using PDBPARSE (PDB files) or DOMAINER (domains), contain 'cleaned-up' data that is self-consistent and error-corrected. Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. | Directory | ./ | ||||||||||
-datafile | This option specifies the the substitution matrix. The substitution matrix is used by the sequence-based scoring schemes. | Comparison matrix file in EMBOSS data path | EBLOSUM62 | ||||||||||
-sparsity | This option specifies the % sparsity of signature. The signature sparsity is a user-defined parameter that determines how many residues the final signature will contain, for example, if the average sequence length of the proteins in the alignment is 250 residues, then a signature of sparsity 10% (default value) will contain 25 key residues or signature positions, that correspond to the top 25% highest scoring alignment positions. | Any integer value | 10 | ||||||||||
-wsiz | This option specifies the window size. When a signature is aligned to a protein sequence, the permissible gaps between two signature positions is determined by the empirical gaps and the window size. The user is prompted for a window size that is used for every position in the signature. Likely this is not optimal. A future implementation will provide a range of methods for generating values of window size depending upon the alignment (window size is identified by the WSIZ record in the signature output file). | Any integer value | 0 | ||||||||||
-seqoption | This option specifies the sequence-based scoring scheme. SIGGEN provides 2 sequence-based scoring schemes that are used to score the input alignment. |
|
3 | ||||||||||
-filtercon | This option specifies whether to disregard positions forming few contacts only during the selection of signature positions. | Toggle value Yes/No | No | ||||||||||
-conthresh | This option specifies the threshold contact number. This controls the selection of key positions for the structure-based scoring scheme (number of contacts). | Any integer value | 10 | ||||||||||
-[no]filterpsim | This option specifies whether to disregard alignment sites that were not aligned satisfactorily (STAMP alignments only). | Boolean value Yes/No | Yes | ||||||||||
[-sigoutdir] (Parameter 2) |
This option specifies the location of signature files (output). A 'signature file' contains a sparse sequence signature suitable for use with the SIGSCAN program. The files are generated by using SIGGEN. | Output directory | ./ | ||||||||||
Additional (Optional) qualifiers | Allowed values | Default | |||||||||||
(none) | |||||||||||||
Advanced (Unprompted) qualifiers | Allowed values | Default | |||||||||||
(none) |
% siggen Generates a sparse protein signature from an alignment and residue contact data. Location of DAF files (domain alignment files) (input) [./]: Specify mode of signature generation 1 : Use positions specified in alignment file 2 : Use a scoring method 3 : Generate a randomised signature Select number [1]: 2 The % sparsity of signature [10]: 15 Window size [0]: 0 Sequence variability scoring method 1 : Substitution matrix 2 : Residue class 3 : None Select number [3]: 1 Substitution matrix to be used [EBLOSUM62]: EBLOSUM62 Residue contacts scoring method 1 : Number 2 : Conservation 3 : Number and conservation 4 : None (structural data available) 5 : None (no structural data available) Select number [5]: 5 Ignore alignment postitions with post_similar value of 0 [Y]: Y Location of signature files (output) [./]: |
Go to the output files for this example
A sparse protein signature was generated for each DAF file (with file
extension of '.daf' specified in the ACD file) in the directory
test_data/. The signatures included high scoring positions from the
alignments, scored on the basis of residue variability calculated by
using the EBLOSUM62 residue substitution matrix (mode 2). Signatures
of 15% sparsity were generated, and the default window size (0) was
used. Alignment postitions with a post_similar value of 0 were
ignored, i.e. not sampled in the signature. No structural data were
available for the domains in the alignments and this was specified as
option 5 in the "Residue contacts scoring method". Signature files
(with file extenion of '.sig' specified in the ACD file) were written
to test_data/siggen/.
FILE TYPE | FORMAT | DESCRIPTION | CREATED BY | SEE ALSO |
Clean coordinate file (for domain) | CCF format (EMBL-like). | Protein coordinate and derived data for a single domain from SCOP or CATH. The data are 'cleaned-up': self-consistent and error-corrected. | DOMAINER | Records for residue solvent accessibility and secondary structure are added to the file by using PDBPLUS. |
Contact file (intra-chain residue-residue contacts) | CON format (EMBL-like.) | Intra-chain residue-residue contact data for a protein or a domain from SCOP or CATH. | CONTACTS | N.A. |
Domain alignment file | DAF format (CLUSTAL-like). | Sequence alignment of domains belonging to the same SCOP or CATH family (or other node in the structural hierarchies). The file is annotated with domain family classification information. | DOMAINALIGN (structure-based sequence alignment of domains of known structure). | DOMAINALIGN alignments can be extended with sequence relatives (of unknown structure) to the family in question by using SEQALIGN. |
Signature file | SIG format | Contains a sparse sequence signature suitable for use with the SIGSCAN program. Contains a sparse sequence signature. | SIGGEN, LIBGEN | The files are generated by using SIGGEN. |
Program name | Description |
---|---|
contactcount | Counts specific versus non-specific contacts in a directory of cleaned protein chain contact files |
contacts | Reads CCF files (clean coordinate files) and writes CON files (contact files) of intra-chain residue-residue contact data |
domainalign | Generates DAF files (domain alignment files) of structure-based sequence alignments for nodes in a DCF file (domain classification file) |
domainrep | Reorder DCF file (domain classification file) so that the representative structure of each user-specified node is given first |
domainreso | Removes low resolution domains from a DCF file (domain classification file) |
interface | Reads CCF files (clean coordinate files) and writes CON files (contact files) of inter-chain residue-residue contact data |
libgen | Generates various types of discriminating elements for each alignment in a directory |
psiphi | Calculates phi and psi torsion angles from cleaned EMBOSS-style protein co-ordinate file |
rocon | Reads a DHF file (domain hits file) of hits (sequences of unknown structural classification) and a DHF file of validation sequences (known classification) and writes a 'hits file' for the hits, which are classified and rank-ordered on the basis of score |
rocplot | Provides interpretation and graphical display of the performance of discriminating elements (e.g. profiles for protein families). rocplot reads file(s) of hits from discriminator-database search(es), performs ROC analysis on the hits, and writes graphs illustrating the diagnostic performance of the discriminating elements |
seqalign | Reads a DAF file (domain alignment file) and a DHF file (domain hits file) and writes a DAF file extended with the hits |
seqfraggle | Removes fragments from DHF files (domain hits files) or other files of sequences |
seqsearch | Generate database hits (sequences) for nodes in a DCF file (domain classification file) by using PSI-BLAST |
seqsort | Reads DHF files (domain hits files) of database hits (sequences) and removes hits of ambiguous classification |
seqwords | Generates DHF files (domain hits files) of database hits (sequences) for nodes in a DCF file (domain classification file) by keyword search of UniProt |
sigscan | Generates a DHF file (domain hits file) of hits (sequences) from scanning a signature against a sequence database |
See also http://emboss.sourceforge.net/
Automatic generation and evaluation of sparse protein signatures for families of protein
structural domains. MJ Blades, JC Ison, R Ranasinghe, and JBC Findlay. Protein Science. 2005 (accepted)
A key residues approach to the definition of protein families and analysis
of sparse family signatures. JC Ison, AJ Bleasby, MJ Blades, SC Daniel,
JH Parish, JBC Findlay. PROTEINS: Structure, Function & Genetics. 2000,
40:330-341
Alignment of a sparse protein signature with protein sequences: application
to fold prediction for three small globulins. SC Daniel, JH Parish,
JC Ison, MJ Blades & JBC Findlay. FEBS Letters. 1999, 459:349-352.