This directory contains files of the ProtoMap classification
(release 3.0)


######################################################
List of files
######################################################

May 26 2000 - 1e-0.class.gz 
	Description: Classification at the lowest level of the 
	hierarchy (level 1e-0). Proteins are sorted by length. 
	Each line is of the form:
	    

	 is a list that may contain several 
	families or may be empty 

	Cluster numbers are positive, except -1,-2 and -3 that 
	stand for "Envelope polyproteins", "Cytochrome b" and 
	"Ribulose large chain" respectively.
	Singletons were not classified at first round, but most 
	of them (excluding very short sequences) are classified 
	as satellite members of other clusters. This list is 
	available upon request. 

######################################################
July 31 2000 - num_homologs.swiss38.gz
	Description: Number of similar sequences found in 
	Gapped-Blast search (with the BLOSUM62 scoring matrix and 
	gap penalties of 11 for opening and 1 for each extension). 
	In this file only SwissProt38 entries are listed, and the
	statistics is based only on SwissProt38 entries. 
	The statistics for all sequences in the NR database is 
	given in the file num_homologs.nr.gz

	Proteins are sorted by length. Each line is of the form:
	   aa <#low-complexity amino acids> lc 

	<#low-complexity amino acids> is the number of amino acids 
	that are marked as part of low-complexity segments according 
	to the program SEG (if most of the sequence is of low-complexity
	expect excess of hits)

	 is a list of pairs  <#hits>
	For example: -49 1 -8 1 -4 2 -3 3 -1 5 0 10
	I.e. there is one hit with  1e-50 < evalue <= 1e-49
	     there is one hit with  1e-9 < evalue <= 1e-8
	     there are two hits with  1e-5 < evalue <= 1e-4
	     there are three hits with  1e-4 < evalue <= 1e-3	
	     there are five hits with  1e-2 < evalue <= 1e-1
	     there are ten hits with  1e-1 < evalue <= 1e-0

######################################################
July 31 2000 - num_homologs.nr.gz
	Description: Number of similar sequences found in 
	Gapped-Blast search (with the BLOSUM62 scoring matrix and 
	gap penalties of 11 for opening and 1 for each extension). 
	This is the statistics for all sequences in the NR database. 

	Proteins are sorted by length. Each line is of the form:
	   aa <#low-complexity amino acids> lc 

	<#low-complexity amino acids> is the number of amino acids 
	that are marked as part of low-complexity segments according 
	to the program SEG (if most of the sequence is of low-complexity
	expect excess of hits)

	 is a list of pairs  <#hits>
	For example: -49 1 -8 1 -4 2 -3 3 -1 5 0 10
	I.e. there is one hit with  1e-50 < evalue <= 1e-49
	     there is one hit with  1e-9 < evalue <= 1e-8
	     there are two hits with  1e-5 < evalue <= 1e-4
	     there are three hits with  1e-4 < evalue <= 1e-3	
	     there are five hits with  1e-2 < evalue <= 1e-1
	     there are ten hits with  1e-1 < evalue <= 1e-0

######################################################
August 30 - seeds.gz
	Description: a complete set of representative proteins of the
	the ProtoMap classification at level 1e-0. The set contains  
	one representative for each cluster of size >= 2 and all
	singletons that are not satellite members of other clusters
	(a total of 42756 proteins). 
	The file is in fasta format.

######################################################
Other files -

class_size
new_class.final
new_class
res.prob
total_num_homologs.nr