This directory contains files of the ProtoMap classification
(release 3.0)
######################################################
List of files
######################################################
May 26 2000 - 1e-0.class.gz
Description: Classification at the lowest level of the
hierarchy (level 1e-0). Proteins are sorted by length.
Each line is of the form:
is a list that may contain several
families or may be empty
Cluster numbers are positive, except -1,-2 and -3 that
stand for "Envelope polyproteins", "Cytochrome b" and
"Ribulose large chain" respectively.
Singletons were not classified at first round, but most
of them (excluding very short sequences) are classified
as satellite members of other clusters. This list is
available upon request.
######################################################
July 31 2000 - num_homologs.swiss38.gz
Description: Number of similar sequences found in
Gapped-Blast search (with the BLOSUM62 scoring matrix and
gap penalties of 11 for opening and 1 for each extension).
In this file only SwissProt38 entries are listed, and the
statistics is based only on SwissProt38 entries.
The statistics for all sequences in the NR database is
given in the file num_homologs.nr.gz
Proteins are sorted by length. Each line is of the form:
aa <#low-complexity amino acids> lc
<#low-complexity amino acids> is the number of amino acids
that are marked as part of low-complexity segments according
to the program SEG (if most of the sequence is of low-complexity
expect excess of hits)
is a list of pairs <#hits>
For example: -49 1 -8 1 -4 2 -3 3 -1 5 0 10
I.e. there is one hit with 1e-50 < evalue <= 1e-49
there is one hit with 1e-9 < evalue <= 1e-8
there are two hits with 1e-5 < evalue <= 1e-4
there are three hits with 1e-4 < evalue <= 1e-3
there are five hits with 1e-2 < evalue <= 1e-1
there are ten hits with 1e-1 < evalue <= 1e-0
######################################################
July 31 2000 - num_homologs.nr.gz
Description: Number of similar sequences found in
Gapped-Blast search (with the BLOSUM62 scoring matrix and
gap penalties of 11 for opening and 1 for each extension).
This is the statistics for all sequences in the NR database.
Proteins are sorted by length. Each line is of the form:
aa <#low-complexity amino acids> lc
<#low-complexity amino acids> is the number of amino acids
that are marked as part of low-complexity segments according
to the program SEG (if most of the sequence is of low-complexity
expect excess of hits)
is a list of pairs <#hits>
For example: -49 1 -8 1 -4 2 -3 3 -1 5 0 10
I.e. there is one hit with 1e-50 < evalue <= 1e-49
there is one hit with 1e-9 < evalue <= 1e-8
there are two hits with 1e-5 < evalue <= 1e-4
there are three hits with 1e-4 < evalue <= 1e-3
there are five hits with 1e-2 < evalue <= 1e-1
there are ten hits with 1e-1 < evalue <= 1e-0
######################################################
August 30 - seeds.gz
Description: a complete set of representative proteins of the
the ProtoMap classification at level 1e-0. The set contains
one representative for each cluster of size >= 2 and all
singletons that are not satellite members of other clusters
(a total of 42756 proteins).
The file is in fasta format.
######################################################
Other files -
class_size
new_class.final
new_class
res.prob
total_num_homologs.nr