SCOPPARSE documentation


 

CONTENTS

1.0 SUMMARY
2.0 INPUTS & OUTPUTS
3.0 INPUT FILE FORMAT
4.0 OUTPUT FILE FORMAT
5.0 DATA FILES
6.0 USAGE
7.0 KNOWN BUGS & WARNINGS
8.0 NOTES
9.0 DESCRIPTION
10.0 ALGORITHM
11.0 RELATED APPLICATIONS
12.0 DIAGNOSTIC ERROR MESSAGES
13.0 AUTHORS
14.0 REFERENCES

1.0 SUMMARY

Reads raw SCOP classification files and writes a DCF file (domain classification file)

2.0 INPUTS & OUTPUTS

SCOPPARSE parses the dir.cla.scop.txt and dir.des.scop.txt SCOP classification files, e.g. available at URLs:
http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.57
http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.57

The format of these files is explained at URL:
http://scop.mrc-lmb.cam.ac.uk/scop/release-notes-1.55.html

SCOPPARSE writes the classification to a DCF file (EMBL-like format). No changes are made to the data other than changing the format in which it is held. The file does not include domain sequence information. The input and output files are specified by the user.

3.0 INPUT FILE FORMAT

An excerpt from the dir.cla.scop.txt (Figure 1) and dir.des.scop.txt (Figure 2) SCOP input files is shown below. The format of these files is explained on the SCOP website:
http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.57

Input files for usage example

File: scop.cla.raw

# dir.cla.scop.txt 
# SCOP release 1.57 (January 2002)  [File format version 1.00]
# http://scop.mrc-lmb.cam.ac.uk/scop/
# Copyright (c) 1994-2002 the scop authors; see http://scop.mrc-lmb.cam.ac.uk/scop/lic/copy.html
d1cs4a_	1cs4	A:	d.58.29.1	39418	cl=53931,cf=54861,sf=55073,fa=55074,dm=55077,sp=55078,px=39418
d1ii7a_	1ii7	A:	d.159.1.4	62415	cl=53931,cf=56299,sf=56300,fa=64427,dm=64428,sp=64429,px=62415

File: scop.des.raw

# dir.des.scop.txt 
# SCOP release 1.57 (January 2002)  [File format version 1.00]
# http://scop.mrc-lmb.cam.ac.uk/scop/
# Copyright (c) 1994-2002 the scop authors; see http://scop.mrc-lmb.cam.ac.uk/scop/lic/copy.html
53931	cl	d	-	Alpha and beta proteins (a+b)
54861	cf	d.58	-	Ferredoxin-like
55073	sf	d.58.29	-	Adenylyl and guanylyl cyclase catalytic domain
55074	fa	d.58.29.1	-	Adenylyl and guanylyl cyclase catalytic domain
55077	dm	d.58.29.1	-	Adenylyl cyclase VC1, domain C1a
55078	sp	d.58.29.1	-	Dog (Canis familiaris)
39418	px	d.58.29.1	d1cs4a_	1cs4 A:
56299	cf	d.159	-	Metallo-dependent phosphatases
56300	sf	d.159.1	-	Metallo-dependent phosphatases
64427	fa	d.159.1.4	-	DNA double-strand break repair nuclease
64428	dm	d.159.1.4	-	Mre11
64429	sp	d.159.1.4	-	Archaeon Pyrococcus furiosus
62415	px	d.159.1.4	d1ii7a_	1ii7 A:

4.0 OUTPUT FILE FORMAT

An example of the DCF output file is shown in Figure 3. The records used to describe an entry are as follows. Records (4) to (9) are used to describe the position of the domain in the SCOP hierarchy.

(1) ID - Domain identifier code. This is a 7-character code that uniquely identifies the domain in SCOP. It is identical to the first 7 characters of a line in the SCOP classification file. The first character is always 'D', the next four characters are the PDB identifier code, the fifth character is the PDB chain identifier to which the domain belongs (a '.' is given in cases where the domain is composed of multiple chains, a '_' is given where a chain identifier was not specified in the PDB file) and the final character is the number of the domain in the chain (for chains comprising more than one domain) or '_' (the chain comprises a single domain only).
(2) EN - PDB identifier code. This is the 4-character PDB identifier code of the PDB entry containing the domain.
(3) TY - domain type. "CATH" or "SCOP" is given ("SCOP" for DCF files generated by using cathparse).
(4) SI - SCOP Sunid's. The integers preceeding the codes CL, FO, SF, FA, DO, SO and DD are the SCOP sunids for Class, Fold, Superfamily, Family, Domain, Source and domain data respectively. These numbers uniquely identify the appropriate node in the SCOP parsable files.
(5) CL - Domain class. It is identical to the text given after 'Class' in the SCOP classification file.
(6) FO - Domain fold. It is identical to the text given after 'Fold' in the SCOP classification file.
(7) SF - Domain superfamily. It is identical to the text given after 'Superfamily' in the SCOP classification file.
(8) FA - Domain family. It is identical to the text given after 'Family' in the SCOP classification file.
(9) DO - Domain name. It is identical to the text given after 'Protein' in the SCOP classification file.
(10) OS - Source of the protein. It is identical to the text given after 'Species' in the SCOP classification file.
(11) DS - Sequence of the domain according to the PDB file. This sequence is taken from the domain clean coordinate file generated by DOMAINER. The DS record will only be present if the DCF file has been processed using DOMAINSEQS.
(12) NC - Number of chains comprising the domain, or number of segments from the same chain that the domain is comprised of. NC is usually 1. If the number of chains is greater than 1, then the domain entry will have a section containing a CN and a CH record (see below) for each chain.
(13) CN - Chain number. The number given in brackets after this record indicates the start of the data for the relevent chain.
(14) CH - Domain definition. The character given before CHAIN is the PDB chain identifier (a '.' is given in cases where a chain identifier was not specified in the DCF file), the strings before START and END give the start and end positions respectively of the domain in the PDB file (a '.' is given in cases where a position was not specified). Note that the start and end positions refer to residue numbering given in the original PDB file and therefore must be treated as strings.
(15) XX - used for spacing.
(16) // - used to delimit records for a domain.

Output files for usage example

File: all.scop

ID   D1CS4A_
XX
EN   1CS4
XX
TY   SCOP
XX
SI   53931 CL; 54861 FO; 55073 SF; 55074 FA; 55077 DO; 55078 SO; 39418 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Ferredoxin-like
XX
SF   Adenylyl and guanylyl cyclase catalytic domain
XX
FA   Adenylyl and guanylyl cyclase catalytic domain
XX
DO   Adenylyl cyclase VC1, domain C1a
XX
OS   Dog (Canis familiaris)
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//
ID   D1II7A_
XX
EN   1II7
XX
TY   SCOP
XX
SI   53931 CL; 56299 FO; 56300 SF; 64427 FA; 64428 DO; 64429 SO; 62415 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Metallo-dependent phosphatases
XX
SF   Metallo-dependent phosphatases
XX
FA   DNA double-strand break repair nuclease
XX
DO   Mre11
XX
OS   Archaeon Pyrococcus furiosus
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//

5.0 DATA FILES

No data files are used.

6.0 USAGE

6.1 COMMAND LINE ARGUMENTS

   Standard (Mandatory) qualifiers:
  [-classfile]         infile     This option specifies the name of raw SCOP
                                  classification file dir.cla.scop.txt_X.XX
                                  (input). This is the raw SCOP classification
                                  file available at
                                  http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.57.
  [-desfile]           infile     This option specifies the name of raw SCOP
                                  description file dir.des.scop.txt_X.XX
                                  (input). This is the raw SCOP description
                                  file available at
                                  http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.57.
   -nosegments         boolean    This option specifies whether to omit
                                  domains comprising of more than one segment.
                                  This is necessary if a continuous residue
                                  sequence is required.
   -nomultichain       boolean    This option specifies whether to omit
                                  domains comprising segments from more than
                                  one chain. This is necessary if a continuous
                                  residue sequence is required.
  [-dcffile]           outfile    This option specifies the name of SCOP DCF
                                  file (domain classification file) (output).
                                  A 'domain classification file' contains
                                  classification and other data for domains
                                  from the SCOP or CATH databases. The file is
                                  generated by using DOMAINER and is in DCF
                                  format (EMBL-like). Domain sequence
                                  information can be added to the file by
                                  using DOMAINSEQS.

   Additional (Optional) qualifiers:
   -nominor            boolean    This option specifies whether to omit
                                  domains from minor classes (defined as
                                  anything not in class 'All alpha proteins',
                                  'All beta proteins', 'Alpha and beta
                                  proteins (a/b)' or 'Alpha and beta proteins
                                  (a+b)'). This is necessary or appropriate
                                  for many analyses.

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-dcffile" associated qualifiers
   -odirectory3        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report deaths


Standard (Mandatory) qualifiers Allowed values Default
[-classfile]
(Parameter 1)
This option specifies the name of raw SCOP classification file dir.cla.scop.txt_X.XX (input). This is the raw SCOP classification file available at http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.57. Input file Required
[-desfile]
(Parameter 2)
This option specifies the name of raw SCOP description file dir.des.scop.txt_X.XX (input). This is the raw SCOP description file available at http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.57. Input file Required
-nosegments This option specifies whether to omit domains comprising of more than one segment. This is necessary if a continuous residue sequence is required. Boolean value Yes/No No
-nomultichain This option specifies whether to omit domains comprising segments from more than one chain. This is necessary if a continuous residue sequence is required. Boolean value Yes/No No
[-dcffile]
(Parameter 3)
This option specifies the name of SCOP DCF file (domain classification file) (output). A 'domain classification file' contains classification and other data for domains from the SCOP or CATH databases. The file is generated by using DOMAINER and is in DCF format (EMBL-like). Domain sequence information can be added to the file by using DOMAINSEQS. Output file test.scop
Additional (Optional) qualifiers Allowed values Default
-nominor This option specifies whether to omit domains from minor classes (defined as anything not in class 'All alpha proteins', 'All beta proteins', 'Alpha and beta proteins (a/b)' or 'Alpha and beta proteins (a+b)'). This is necessary or appropriate for many analyses. Boolean value Yes/No No
Advanced (Unprompted) qualifiers Allowed values Default
(none)

6.2 EXAMPLE SESSION


An example of interactive use of SCOPPARSE is shown below. Here is a sample session with scopparse


% scopparse 
Reads raw SCOP classification files and writes a DCF file (domain
classification file).
Name of raw SCOP classification file dir.cla.scop.txt_X.XX (input).: scop.cla.raw
Name of raw SCOP description file dir.des.scop.txt_X.XX (input).: scop.des.raw
Omit domains comprising of more than one segment. [N]: Y
Omit domains comprising segments from more than one chain. [N]: N
Name of SCOP DCF file (domain classification file) (output). [test.scop]: all.scop

Go to the input files for this example
Go to the output files for this example



The raw SCOP classification files /test_data/scop.cla.raw and /test_data/scop.des.raw and a domain classification file in DCF (EMBL-like) format called /test_data/scopparse/all.scop was written. The output file does not contain domains that comprise segments one or more segments or segments from more than one chain.

7.0 KNOWN BUGS & WARNINGS

None.

8.0 NOTES

Some SCOP domains are comprised of more than one segments of polypeptide chain, these segments belonging to a single or more than one polypeptide chains. It is debatable whether a domain (using the widely accepted definition) can truly consist of regions from more than polypeptide. Accordingly, SCOPPARSE gives the option of omitting from the output file domains that consist of more than one segment and domains that consist of more than one segment where the segments are from different chains.

SCOP includes several minor classes which are not appropriate for some anaylses. Accordingly, SCOPPARSE gives the option to omit domains from minor classes. This is defined as anything not in class 'All alpha proteins', 'All beta proteins', 'Alpha and beta proteins (a/b)' or 'Alpha and beta proteins (a+b)'

8.1 GLOSSARY OF FILE TYPES

FILE TYPE FORMAT DESCRIPTION CREATED BY SEE ALSO
Domain classification file (for SCOP) DCF format (EMBL-like format for domain classification data). Classification and other data for domains from SCOP. The file is in DCF format (EMBL-like). SCOPPARSE Domain sequence information can be added to the file by using DOMAINSEQS.

8.3 DEPRECATED RECORDS

The following records for database sequence are no longer used in a DCF file.
(1) AC - Accession number of the domain sequence. This record will only be present if the DCF file has been processed using DOMAINSEQS and if an accession number for the PDB file corresponding to the domain is given in the swissprot:PDB equivalence file (generated by PDBTOSP) that DOMAINSEQS makes use of.
(2) SP - Swissprot code of the domain sequence. This record will only be present if the domain classification file has been processed using DOMAINSEQS and if an swissprot code for the PDB file corresponding to the domain is given in the swissprot:PDB equivalence file (generated by PDBTOSP) that DOMAINSEQS makes use of.
(3) RA - Position of domain in swissprot sequence. The integers preceeding START and END give the start and end points respectively of the domain sequence relative to the full-length swissprot sequence.
(4) SQ - Sequence of the domain according to swissprot. This sequence is taken from the swissprot database. The SQ record will only be present if the SCOP classification file has been processed using DOMAINSEQS and if an accession number for the PDB file corresponding to the domain is given in the swissprot:PDB equivalence file (generated by PDBTOSP) that DOMAINSEQS makes use of.
XX
AC   P02213
XX
SP   GLB1_SCAIN
XX
RA   1 START; 146 END;
XX
SQ   SEQUENCE   146 AA;  15947 MW;  5868B4E5 CRC32;
     PSVYDAAAQL TADVKKDLRD SWKVIGSDKK GNGVALMTTL FADNQETIGY FKRLGDVSQG
     MANDKLRGHS ITLMYALQNF IDQLDNPDDL VCVVEKFAVN HITRKISAAE FGKINGPIKK
     VLASKNFGDK YANAWAKLVA VVQAAL
XX
AC   P02213
XX
SP   GLB1_SCAIN
XX
RA   1 START; 146 END; 
XX
SQ   SEQUENCE   146 AA;  15947 MW;  5868B4E5 CRC32;
     PSVYDAAAQL TADVKKDLRD SWKVIGSDKK GNGVALMTTL FADNQETIGY FKRLGDVSQG
     MANDKLRGHS ITLMYALQNF IDQLDNPDDL VCVVEKFAVN HITRKISAAE FGKINGPIKK
     VLASKNFGDK YANAWAKLVA VVQAAL
None

9.0 DESCRIPTION

The raw SCOP classification files are inconvenient for some uses because the text describing the domain classification is given in a different file to the classification itself, the file formats are not easily extended and differ from other related classifications such as CATH. SCOPPARSE reads the raw SCOP classification files and writes a single file in DCF (EMBL-like) format, which is an easier format to work with, is more human-readable and is more extensible than the native SCOP database format.

10.0 ALGORITHM

None.

11.0 RELATED APPLICATIONS

See also

Program nameDescription
aaindexextractExtract data from AAINDEX
allversusallDoes an all-versus-all global alignment for each set of sequences in an input directory and writes files of sequence similarity values
cathparseReads raw CATH classification files and writes DCF file (domain classification file)
cutgextractExtract data from CUTG
domainerReads CCF files (clean coordinate files) for proteins and writes CCF files for domains, taken from a DCF file (domain classification file)
domainnrRemoves redundant domains from a DCF file (domain classification file). The file must contain domain sequence information, which can be added by using DOMAINSEQS
domainseqsAdds sequence records to a DCF file (domain classification file)
domainsseAdds secondary structure records to a DCF file (domain classification file)
hetparseConverts raw dictionary of heterogen groups to a file in EMBL-like format
pdbparseParses PDB files and writes CCF files (clean coordinate files) for proteins
pdbplusAdd residue solvent accessibility and secondary structure data to a CCF file (clean coordinate file) for a protein or domain
pdbtospConvert raw swissprot:PDB equivalence file to EMBL-like format
printsextractExtract data from PRINTS
prosextractBuilds the PROSITE motif database for patmatmotifs to search
rebaseextractExtract data from REBASE
seqnrRemoves redundancy from DHF files (domain hits files) or other files of sequences
sitesReads CCF files (clean coordinate files) and writes CON files (contact files) of residue-ligand contact data for domains in a DCF file (domain classification file)
ssematchSearches a DCF file (domain classification file) for secondary structure matches
tfextractExtract data from TRANSFAC
SEQNR and SEQALIGN use a domain classification file as input. SEQNR and SEQALIGN require the file to contain domain sequence information, which can be added by using DOMAINSEQS.

12.0 DIAGNOSTIC ERROR MESSAGES

None.

13.0 AUTHORS

Alan Bleasby (ableasby@hgmp.mrc.ac.uk)

Jon Ison (jison@rfcgr.mrc.ac.uk)
MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

14.0 REFERENCES

Please cite the authors and EMBOSS.

Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.

See also http://emboss.sourceforge.net/

14.1 Other useful references

1. Conte, L.L., Ailey, B., Hubbard, T.J. Brenner, S.E., Murzin, A.G. and Chothia, C. (2000) SCOP: a structural classification of proteins database. Nucleic Acids Res. 28, 257-259.