seqjoin is used to predict the complete cDNA insert sequence of partially sequenced cDNA clones. The clones' partial experimental sequence are matched to a database of complete cDNA sequence. If a match is found, the clone's insert sequence is predicted from the vector sequence, the sequence of the database cDNA sequence entry that was matched and the experimental, partial clone sequence. seqjoin is based on the output of the sequence analysis programs phred, phrap, cross_match and also uses the Emboss package.
The prediction of the complete cDNA inserts by the seqjoin program uses a set of rules and assumptions. The experimental clone sequences (= tag sequences) are assumed to be derived from the 5'-end and to contain a small stretch of vector sequence. The tag sequence are aligned to the vector sequence and to a full-length cDNA sequence database using the program cross_match. The part of the tag sequence, that aligns to the vector sequence, will be removed and replaced by the vector sequence. Thus any sequencing errors introduced in this range are eliminated.
The remainder of the tag sequence will be replaced by the complete - or 5'-truncated - full-length database sequence, - provided that the alignment to the experimental and the database sequence suggest that they originate from the same transcript. If the alignments of the tag sequence to the vector and the database entry are not adjacent, the gap has to be closed by the experimental sequence, provided that the sequence quality in this range is sufficient.
Using the quality measures provided by the phred program, differences between the experimental and the database sequences are taken into account. A set of rules are applied to differentiate sequencing errors, substitutions representing single nucleotide polymorphisms (SNP), stretches of substitutions suggesting alternative splicing, insertions or deletions representing polymorphisms or suggesting that the aligned sequences represent alternative splice forms.
If cross_match identifies more than one alignment of the experimental to the database sequence, alternative splice forms are assumed. While single substitutions are taken into account, single deletions or insertions are ignored. We assume that single substitutions or insertions leading to frame shifts would represent sequencing errors rather than real polymorphims. Alternative splicing prevents insert prediction by the seqjoin program.
The seqjoin program produces a number of output files. A file with commands of the Emboss package is prepared that is used later to prepare the actual sequence manipulation and joining steps. For each alignment found by cross_match, a comment line is entered into the file seqjoin.stat.all. This file contains details on the sequence joining and indicates which alignments and predicted insert sequences might require additional manual inspection. For alignments that the program could not use to predict the inserts sequence, a comment indicating the reason is given.