EMBL Outstation - The European Bioinformatics Institute
European Nucleotide Archive annotated/assembled sequences

User Manual

Release 143, March 2020

European Bioinformatics Institute
Wellcome Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom
Telephone: +44-1223-494499
Telefax : +44-1223-494468
Electronic mail: datasubs@ebi.ac.uk
URL: http://www.ebi.ac.uk/ena

This manual and the database it accompanies may be
copied and redistributed freely, without advance
permission, provided that this statement is
reproduced with each copy.

CONTENTS
1 INTRODUCTION
2 CONVENTIONS USED IN THE DATABASE
2.1 Sequence Data
2.2 Organism Identification and Classification
2.3 Literature References
3 FORMAT OF THE DATABASE
3.1 Data Class
3.2 Taxonomic Division
3.3 Structure of an Entry
3.4 Line Structure
3.4.1 The ID Line
3.4.2 The AC Line
3.4.3 The PR Line
3.4.4 The DT Line
3.4.5 The DE Line
3.4.6 The KW Line
3.4.7 The OS Line
3.4.8 The OC Line
3.4.9 The OG Line
3.4.10 The Reference (RN, RC, RP, RX, RG, RA, RT, RL)
Lines
3.4.10.1 The RN Line
3.4.10.2 The RC Line
3.4.10.3 The RP Line
3.4.10.4 The RX Line
3.4.10.5 The RG Line
3.4.10.6 The RA Line
3.4.10.7 The RT Line
3.4.10.8 The RL Line
3.4.11 The DR Line
3.4.12 The AH Line
3.4.13 The AS Line
3.4.14 The CO Line
3.4.15 The FH Line
3.4.16 The FT Line
3.4.17 The SQ Line
3.4.18 The Sequence Data Line
3.4.19 The CC Line
3.4.20 The XX Line
3.4.21 The // Line

APPENDIX A STANDARD BASE CODES
APPENDIX B MODIFIED BASE CODES
APPENDIX C REFERENCES FOR ABBREVIATIONS AND SYMBOLS

1 INTRODUCTION
This document describes the format and conventions used in ENA sequence
records. An attempt has been made to make the collected data as easily
accessible as possible without restricting their usefulness to any
particular type of computing environment. For this reason, the simplest
possible organisation ("flat file") has been chosen.
The main body of this User Manual describes the features of the database which
will remain stable, such as the flat file format and the use of line types
to distinguish different kinds of information. Features of the
database more likely to require change (such as journal abbreviations)
are described in the appendices. Information which applies specifically
to the current release of the database is presented in the Release
Notes. The Release Notes also describe changes which are foreseen in future
releases.
It is likely that the need to represent new kinds of information in the
database will eventually necessitate changes or additions to the
presentation of data.
Such changes will be made as far as possible in ways which have minimal impact
on user programs and procedures. For example, a new type of data could be
added to the database as a new line type (see Section 3) without
affecting the processing of existing line types.
We would like to stress that both this manual and the database itself are free
from any copyright restrictions (please see the statement on the title
page). While we would appreciate acknowledgement if our efforts have been
useful to you, we want to ensure that the data are freely available to anyone
interested.

2 CONVENTIONS USED IN THE DATABASE
This section describes the general conventions which have been applied to
The information in the database in order to achieve uniformity of
presentation.
Specific abbreviations and symbol usage are summarized in the appendices.

2.1 Sequence Data
Nucleotide sequence data are generally presented in the database as they
have been submitted or published, subject to certain conventions which have been
established for the database as a whole. The sequences are always listed in the
5' to 3' direction, regardless of the published order. Bases are numbered
sequentially beginning with 1 at the 5' end of the sequence.
The sequences are presented in the database in a form corresponding to the
biological state of the information in vivo. Thus, cDNA sequences are stored
in the database as RNA sequences, even though they usually appear in the
literature as DNA. For genomic data, the coding strand is stored. Data
containing coding sequences on both strands are stored according to the
prevailing conventions in the literature. The stored data generally
correspond to wild type sequences before mutation or genetic manipulation.
Sequences of tRNA molecules are stored as unmodified RNA sequences (equivalent
to the mature transcript before any base modification occurs). This form
(colinear with the genomic sequence) has been adopted to simplify both
storage and analysis of the sequences. Thus, a modified base appears in
the sequence as the corresponding unmodified base. However, each base
modification is noted in the feature table, so that the mature
tRNA sequence can be restored automatically by a simple computer program
if this is desirable. The two-letter code used by Sprinzl and Gauss has
been adopted for abbreviation of modified bases in the feature table.

2.2 Organism Identification and Classification
A unified taxonomy is used by the collaborating databases of the International
Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org).
Based on the NCBI's 'Taxon' project, this constitutes a taxonomy database which
reflects current phylogenetic knowledge. It is a sequence-based taxonomy as
far as possible, and is based upon published authorities wherever appropriate.
Deciding criteria include a variety of physiological, ecological, morphological
characters, overall morphological similarity and common descent.
Evolutionary taxonomists tend to consider both overall similarity
and common descent when making and assigning a classification while
phylogeneticists attempt to reflect the branching pattern of the underlying
phylogenetic tree. There is of course no such thing as a single best method for
classifying organisms and the choice of one system over the other has to be made
with regard to the particular purpose of the classification. Because of the
inherent ambiguity of evolutionary classification and the specific needs of
database users (e.g. trying to track down the phylogenetic history of a group
of organisms or to elucidate the evolution of a molecule), the taxonomy strives
to reflect accurately current phylogenetic knowledge.
One of the major sources for classification are phylogenetic insights derived
from molecular evolution studies. New taxonomic information is included
as soon as it becomes available, but at the same time, efforts are made to
ensure that the arguments and evidence provided are reliable in order to avoid
frequent (and possibly unnecessary) changes to the classification system.
The OS/OC lines of all entries reflect the up to date taxonomic classification.
This classification is intended to be informative and helpful; no claim is
made that it is necessarily the best or most exact. This information is subject
to change in future editions.
According to the Feature Table Definition, an entry's sequence span has
to be covered by one source feature or a combination of several. 'Synthetic
constructs' are one type of sequence entry which typically contain several
source features. Here one of these source features spans the whole sequence
(/organism="synthetic construct"). The feature qualifier /focus is attached to
the preferred source feature and used to determine the taxonomic division. If
no translation table is specified, the organism with /focus will define the
translation table. Within an entry with several source features, only one will
exist with /focus on it.

2.3 Literature References
The references cited for an entry should be considered a pointer to the
literature and not a scientific credit for the elucidation of the
sequence. Although every effort is made to give complete reference
information, occasionally only a secondary source has been cited. This
has happened most frequently in cases where a secondary reference has
presented the data in a form easily entered. The speed and accuracy with
which data can be abstracted is very dependent on the form of presentation.
In such cases, we prefer to cite also the primary reference, and request
users who note such omissions to inform us so that the appropriate additions
may be made.

3 FORMAT OF THE DATABASE
The ENA assembled/annotated sequence release and update products are composed
of sequence entries. Each entry corresponds to a single contiguous molecule as
contributed to the database or reported in the literature. In some cases, entries
have been assembled from several papers reporting overlapping sequence regions.
Conversely a single paper often provides data for several entries, as when
homologous sequences from different organisms are compared.

3.1 Data Class
The data class of each entry, representing a methodological approach to the
generation of the data or a type of data, is indicated on the first (ID) line
of the entry. Each entry belongs to exactly one data class.
Class Definition
----------- -----------------------------------------------------------
CON Entry constructed from segment entry sequences; if unannotated,
annotation may be drawn from segment entries
PAT Patent
EST Expressed Sequence Tag
GSS Genome Survey Sequence
HTC High Thoughput CDNA sequencing
HTG High Thoughput Genome sequencing
WGS Whole Genome Shotgun
TSA Transcriptome Shotgun Assembly
STS Sequence Tagged Site
STD Standard (all entries not classified as above)

3.2 Taxonomic Division
The entries which constitute the database are grouped into taxonomic divisions,
the object being to create subsets of the database which reflect areas of
interest for many users.
In addition to the division, each entry contains a full taxonomic
classification of the organism that was the source of the stored sequence,
from kingdom down to genus and species (see below).
Each entry belongs to exactly one taxonomic division. The ID line of each entry
indicates its taxonomic division, using the three letter codes shown below:

Division Code
----------------- ----
Bacteriophage PHG
Environmental Sample ENV
Fungal FUN
Human HUM
Invertebrate INV
Other Mammal MAM
Other Vertebrate VRT
Mus musculus MUS
Plant PLN
Prokaryote PRO
Other Rodent ROD
Synthetic SYN
Transgenic TGN
Unclassified UNC
Viral VRL

3.3 Structure of an Entry
The entries in the database are structured so as to be usable by human
readers as well as by computer programs. The explanations, descriptions,
classifications and other comments are in ordinary English, and the symbols
and formatting employed for the base sequences themselves have been
chosen for readability. Wherever possible, symbols familiar to molecular
biologists have been used. At the same time, the structure is systematic
enough to allow computer programs easily to read, identify, and manipulate
the various types of data included.
Each entry in the database is composed of lines. Different types of lines,
each with its own format, are used to record the various types of data which
make up the entry. In general, fixed format items have been kept to a
minimum, and a more syntax-oriented structure adopted for the lines.
The two exceptions to this are the sequence data lines and the feature table
lines, for which a fixed format was felt to offer significant advantages
to the user. Users who write programs to process the database entries should
not make any assumptions about the column placement of items on lines other
than these two: all other line types are free-format.
A sample entry is shown in Figure 1.
Note that each line begins with a two-character line code, which indicates
the type of information contained in the line. The currently used line
types, along with their respective line codes, are listed below:
ID - identification (begins each entry; 1 per entry)
AC - accession number (>=1 per entry)
PR - project identifier (0 or 1 per entry)
DT - date (2 per entry)
DE - description (>=1 per entry)
KW - keyword (>=1 per entry)
OS - organism species (>=1 per entry)
OC - organism classification (>=1 per entry)
OG - organelle (0 or 1 per entry)
RN - reference number (>=1 per entry)
RC - reference comment (>=0 per entry)
RP - reference positions (>=1 per entry)
RX - reference cross-reference (>=0 per entry)
RG - reference group (>=0 per entry)
RA - reference author(s) (>=0 per entry)
RT - reference title (>=1 per entry)
RL - reference location (>=1 per entry)
DR - database cross-reference (>=0 per entry)
CC - comments or notes (>=0 per entry)
AH - assembly header (0 or 1 per entry)
AS - assembly information (0 or >=1 per entry)
FH - feature table header (2 per entry)
FT - feature table data (>=2 per entry)
XX - spacer line (many per entry)
SQ - sequence header (1 per entry)
CO - contig/construct line (0 or >=1 per entry)
bb - (blanks) sequence data (>=1 per entry)
// - termination line (ends each entry; 1 per entry)
Note that some entries will not contain all of the line types, and some line
types occur many times in a single entry. As indicated, each entry begins with
an identification line (ID) and ends with a terminator line (//). The various
line types appear in entries in the order in which they are listed above
(except for XX lines which may appear anywhere between the ID and SQ lines). A
detailed description of each line type is given in the following sections.

ID X56734; SV 1; linear; mRNA; STD; PLN; 1859 BP.
XX
AC X56734; S46826;
XX
DT 12-SEP-1991 (Rel. 29, Created)
DT 25-NOV-2005 (Rel. 85, Last updated, Version 11)
XX
DE Trifolium repens mRNA for non-cyanogenic beta-glucosidase
XX
KW beta-glucosidase.
XX
OS Trifolium repens (white clover)
OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids;
OC fabids; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium.
XX
RN [5]
RP 1-1859
RX DOI; 10.1007/BF00039495.
RX PUBMED; 1907511.
RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;
RT "Nucleotide and derived amino acid sequence of the cyanogenic
RT beta-glucosidase (linamarase) from white clover (Trifolium repens L.)";
RL Plant Mol. Biol. 17(2):209-219(1991).
XX
RN [6]
RP 1-1859
RA Hughes M.A.;
RT ;
RL Submitted (19-NOV-1990) to the INSDC.
RL Hughes M.A., University of Newcastle Upon Tyne, Medical School, Newcastle
RL Upon Tyne, NE2 4HH, UK
XX
DR EuropePMC; PMC99098; 11752244.
XX
FH Key Location/Qualifiers
FH
FT source 1..1859
FT /organism="Trifolium repens"
FT /mol_type="mRNA"
FT /clone_lib="lambda gt10"
FT /clone="TRE361"
FT /tissue_type="leaves"
FT /db_xref="taxon:3899"
FT mRNA 1..1859
FT /experiment="experimental evidence, no additional details
FT recorded"
FT CDS 14..1495
FT /product="beta-glucosidase"
FT /EC_number="3.2.1.21"
FT /note="non-cyanogenic"
FT /db_xref="GOA:P26204"
FT /db_xref="InterPro:IPR001360"
FT /db_xref="InterPro:IPR013781"
FT /db_xref="InterPro:IPR017853"
FT /db_xref="InterPro:IPR018120"
FT /db_xref="UniProtKB/Swiss-Prot:P26204"
FT /protein_id="CAA40058.1"
FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI
FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK
FT DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ
FT VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR
FT CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD
FT DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF
FT IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ
FT EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA
FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"
XX
SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60
cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120
tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga 180
aggtgcagta aacgaaggcg gtagaggacc aagtatttgg gataccttca cccataaata 240
tccagaaaaa ataagggatg gaagcaatgc agacatcacg gttgaccaat atcaccgcta 300
caaggaagat gttgggatta tgaaggatca aaatatggat tcgtatagat tctcaatctc 360
ttggccaaga atactcccaa agggaaagtt gagcggaggc ataaatcacg aaggaatcaa 420
atattacaac aaccttatca acgaactatt ggctaacggt atacaaccat ttgtaactct 480
ttttcattgg gatcttcccc aagtcttaga agatgagtat ggtggtttct taaactccgg 540
tgtaataaat gattttcgag actatacgga tctttgcttc aaggaatttg gagatagagt 600
gaggtattgg agtactctaa atgagccatg ggtgtttagc aattctggat atgcactagg 660
aacaaatgca ccaggtcgat gttcggcctc caacgtggcc aagcctggtg attctggaac 720
aggaccttat atagttacac acaatcaaat tcttgctcat gcagaagctg tacatgtgta 780
taagactaaa taccaggcat atcaaaaggg aaagataggc ataacgttgg tatctaactg 840
gttaatgcca cttgatgata atagcatacc agatataaag gctgccgaga gatcacttga 900
cttccaattt ggattgttta tggaacaatt aacaacagga gattattcta agagcatgcg 960
gcgtatagtt aaaaaccgat tacctaagtt ctcaaaattc gaatcaagcc tagtgaatgg 1020
ttcatttgat tttattggta taaactatta ctcttctagt tatattagca atgccccttc 1080
acatggcaat gccaaaccca gttactcaac aaatcctatg accaatattt catttgaaaa 1140
acatgggata cccttaggtc caagggctgc ttcaatttgg atatatgttt atccatatat 1200
gtttatccaa gaggacttcg agatcttttg ttacatatta aaaataaata taacaatcct 1260
gcaattttca atcactgaaa atggtatgaa tgaattcaac gatgcaacac ttccagtaga 1320
agaagctctt ttgaatactt acagaattga ttactattac cgtcacttat actacattcg 1380
ttctgcaatc agggctggct caaatgtgaa gggtttttac gcatggtcat ttttggactg 1440
taatgaatgg tttgcaggct ttactgttcg ttttggatta aactttgtag attagaaaga 1500
tggattaaaa aggtacccta agctttctgc ccaatggtac aagaactttc tcaaaagaaa 1560
ctagctagta ttattaaaag aactttgtag tagattacag tacatcgttt gaagttgagt 1620
tggtgcacct aattaaataa aagaggttac tcttaacata tttttaggcc attcgttgtg 1680
aagttgttag gctgttattt ctattatact atgttgtagt aataagtgca ttgttgtacc 1740
agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac 1800
tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa 1859
//
Figure 1 - A sample entry from the database

3.4 Line Structure
This section describes in detail the format of each type of line used in
the database. Each line begins with a two-character line type code.
This code is always followed by three blanks, so that the actual information
in each line begins in character position 6.

3.4.1 The ID Line
The ID (IDentification) line is always the first line of an entry. The
format of the ID line is:
ID <1>; SV <2>; <3>; <4>; <5>; <6>; <7> BP.
The tokens represent:
1. Primary accession number
2. Sequence version number
3. Topology: 'circular' or 'linear'
4. Molecule type (see note 1 below)
5. Data class (see section 3.1)
6. Taxonomic division (see section 3.2)
7. Sequence length (see note 2 below)

Note 1 - Molecule type: this represents the type of molecule as stored and can
be any value from the list of current values for the mandatory mol_type source
qualifier. This item should be the same as the value in the mol_type
qualifier(s) in a given entry.
Note 2 - Sequence length: The last item on the ID line is the length of the
sequence (the total number of bases in the sequence). This number includes
base positions reported as present but undetermined (coded as "N").
An example of a complete identification line is shown below:
ID CD789012; SV 4; linear; genomic DNA; HTG; MAM; 500 BP.
3.4.2 The AC Line
The AC (ACcession number) line lists the accession numbers associated with
the entry.

Examples of accession number lines are shown below:
AC X56734; S46826;
AC Y00001; X00001-X00005; X00008; Z00001-Z00005;
Each accession number, or range of accession numbers, is terminated by a
semicolon. Where necessary, more than one AC line is used. Consecutive
secondary accession numbers in ENA flatfiles are shown in the form of
inclusive accession number ranges.
Accession numbers are the primary means of identifying sequences providing
a stable way of identifying entries from release to release. An accession
number, however, always remains in the accession number list of the latest
version of the entry in which it first appeared. Accession numbers allow
unambiguous citation of database entries. Researchers who wish to cite entries
in their publications should always cite the first accession number in the
list (the "primary" accession number) to ensure that readers can find the
relevant data in a subsequent release. Readers wishing to find the data thus
cited must look at all the accession numbers in each entry's list.
Secondary accession numbers: One reason for allowing the existence of several
accession numbers is to allow tracking of data when entries are merged
or split. For example, when two entries are merged into one, a "primary"
accession number goes at the start of the list, and those from the
merged entries are added after this one as "secondary" numbers.

Example: AC X56734; S46826;

Similarly, if an existing entry is split into two or more entries (a rare
occurrence), the original accession number list is retained in all the derived
entries.
An accession number is dropped from the database only when the data to
which it was assigned have been completely removed from the database.

3.4.3 The PR Line
The PR (PRoject) line shows the International Nucleotide Sequence Database
Collaboration (INSDC) Project Identifier that has been assigned to the entry.
Full details of INSDC Project are available at
http://www.ebi.ac.uk/ena/about/page.php?page=project_guidelines.
Example: PR Project:17285;

3.4.4 The DT Line
The DT (DaTe) line shows when an entry first appeared in the database and
when it was last updated. Each entry contains two DT lines, formatted
as follows:
DT DD-MON-YYYY (Rel. #, Created)
DT DD-MON-YYYY (Rel. #, Last updated, Version #)
The DT lines from the above example are:
DT 12-SEP-1991 (Rel. 29, Created)
DT 13-SEP-1993 (Rel. 37, Last updated, Version 8)
The date supplied on each DT line indicates when the entry was created or
Last updated; that will usually also be the date when the new or modified
Entry became publicly visible via the EBI network servers. The release
number indicates the first quarterly release made *after* the entry was
created or last updated. The version number appears only on the "Last
updated" DT line.
The absolute value of the version number is of no particular significance; its
purpose is to allow users to determine easily if the version of an entry
which they already have is still the most up to date version. Version numbers
are incremented by one every time an entry is updated; since an entry may be
updated several times before its first appearance in a quarterly release, the
version number at the time of its first release appearance may be greater than
one. Note that because an entry may also be updated several times between
two quarterly releases, there may be gaps in the sequence of version numbers
which appear in consecutive releases.
If an entry has not been updated since it was created, it will still have
two DT lines and the "Last updated" line will have the same date (and
release number) as the "Created" line.

3.4.5 The DE Line
The DE (Description) lines contain general descriptive information about the
sequence stored. This may include the designations of genes for which the
sequence codes, the region of the genome from which it is derived, or other
information which helps to identify the sequence. The format for a DE line is:
DE description
The description is given in ordinary English and is free-format. Often, more
than one DE line is required; when this is the case, the text is divided only
between words. The description line from the example above is
DE Trifolium repens mRNA for non-cyanogenic beta-glucosidase
The first DE line generally contains a brief description, which can stand
alone for cataloguing purposes.

3.4.6 The KW Line
The KW (KeyWord) lines provide information which can be used to generate
cross-reference indexes of the sequence entries based on functional,
structural, or other categories deemed important.
The format for a KW line is:
KW keyword[; keyword ...].
More than one keyword may be listed on each KW line; the keywords are
separated by semicolons, and the last keyword is followed by a full
stop. Keywords may consist of more than one word, and they may contain
embedded blanks and stops. A keyword is never split between lines.
An example of a keyword line is:
KW beta-glucosidase.
The keywords are ordered alphabetically; the ordering implies no hierarchy
of importance or function. If an entry has no keywords assigned to it,
it will contain a single KW line like this:
KW .

3.4.7 The OS Line
The OS (Organism Species) line specifies the preferred scientific name of
the organism which was the source of the stored sequence. In most
cases this is done by giving the Latin genus and species designations,
followed (in parentheses) by the preferred common name in English where
known. The format is:
OS Genus species (name)
In some cases, particularly for viruses and genetic elements, the only
accepted designation is a simple name such as "Canine adenovirus type 2".
In these cases only this designation is given. The species line from the
example is:
OS Trifolium repens (white clover)
Hybrid organisms are classified in their own right. A rat/mouse hybrid,
for example, would appear as follows:
OS Mus musculus x Rattus norvegicus
OC (OC for mouse)

If the source organism is unknown but has been/will be cultured, the OS
line will contain a unique name derived from the what is known of the
classification. The unique name serves to identify the database entry,
which will be updated once the full classification is known. In the
case of an unknown bacterium, for example:
OS unidentified bacterium B8
OC Bacteria.
For environmental samples where there is no intention to culture the
organism and complete taxonomy cannot be determined, collective names
are used in the OS line and the classification given extends down to
the most resolved taxonomic node possible, for example:
OS uncultured proteobacterium
OC Bacteria; Proteobacteria; environmental samples.

For naturally occurring plasmids the OS/OC lines will contain the
source organism and the plasmid name will appear on the OG line.
For example:
OS Escherichia coli
OC Prokaryota; ... Enterobacteriaceae.
XX
OG Plasmid colE1
For artificial plasmids the OS line will be "OS Cloning vector" and the
sequence will be classified as an artificial sequence. For example:
OS Cloning vector M13plex17
OC Artificial sequences; vectors.

Where only a naturally occurring part of a plasmid is reported, the plasmid
name will appear on the OG line and the OS/OC lines will describe the natural
source.
For example:
OS Escherichia coli
OC Prokaryota; ... Enterobacteriaceae.
XX
OG Plasmid pUC8

3.4.8 The OC Line
The OC (Organism Classification) lines contain the taxonomic classification
Of the source organism as described in Section 2.2 above.
The classification is listed top-down as nodes in a taxonomic tree in which
the most general grouping is given first. The classification may be
distributed over several OC lines, but nodes are not split or hyphenated
between lines. The individual items are separated by semicolons and the
list is terminated by a full stop. The format for the OC line is:
OC Node[; Node...].

Example classification lines:
OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons; Rosidae;
OC Fabales; Fabaceae; Papilionoideae; Trifolium.

3.4.9 The OG Line
The OG (OrGanelle) linetype indicates the sub-cellular location of non-nuclear
sequences. It is only present in entries containing non-nuclear sequences
and appears after the last OC line in such entries.
The OG line contains
a) one data item (title cased) from the controlled list detailed under the
/organelle qualifier definition in the Feature Table Definition document
that accompanies this release or
b) a plasmid name.
Examples include "Mitochondrion", "Plastid:Chloroplast" and "Plasmid pBR322".

For example, a chloroplast sequence from Euglena gracilis would appear as:
OS Euglena gracilis (green algae)
OC Eukaryota; Planta; Phycophyta; Euglenophyceae.
OG Plastid:Chloroplast

3.4.10 The Reference (RN, RC, RP, RX, RG, RA, RT, RL) Lines
These lines comprise the literature citations within the database.
The citations provide access to the papers from which the data has been
abstracted. The reference lines for a given citation occur in a block, and
are always in the order RN, RC, RP, RX, RG, RA, RT, RL. Within each such
reference block the RN line occurs once, the RC, RP and RX lines occur zero
or more times, and the following lines must occur at least once: the RA (or RG), RT, RL.
If several references are given, there will be a reference block for each.
Example of references :

RN [5]
RP 1-1859
RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;
RT "Nucleotide and derived amino acid sequence of the cyanogenic
RT beta-glucosidase (linamarase) from white clover (Trifolium repens L.).";
RL Plant Mol. Biol. 17:209-219(1991).

The formats of the individual lines are explained in the following
paragraphs.
RN [2]
RP 1-1657990
RG Prochlorococcus genome consortium
RA Larimer F.;
RT ;
RL Submitted (03-JUL-2003) to the INSDC.
RL Larimer F., DOE Joint Genome Institute, Production Genomics Facility,
RL 2800 Mitchell Drive, Walnut Creek, CA 94598, USA, and the Genome
RL Analysis Group, Oak Ridge National Laboratory, 1060 Commerce Park Drive,
RL Oak Ridge, TN 37831, USA;

3.4.10.1 The RN Line
The RN (Reference Number) line gives a unique number to each reference
Citation within an entry. This number is used to designate the reference
in comments and in the feature table. The format of the RN line is:
RN [n]
The reference number is always enclosed in square brackets. Note that the
set of reference numbers which appear in an entry does not necessarily form a
continuous sequence from 1 to n, where the entry contains "n" references. As
references are added to and removed from an entry, gaps may be introduced into
the sequence of numbers. The important point is that once an RN number has
been assigned to a reference within an entry it never changes. The reference
number line in the example above is:
RN [5]

3.4.10.2 The RC Line
The RC (Reference Comment) linetype is an optional linetype which appears if
The reference has a comment. The comment is in English and as many RC lines as
are required to display the comment will appear. They are formatted thus:
RC comment

3.4.10.3 The RP Line
The RP (Reference Position) linetype is an optional linetype which appears if
one or more contiguous base spans of the presented sequence can be attributed
to the reference in question. As many RP lines as are required to display the
base span(s) will appear.
The base span(s) indicate which part(s) of the sequence are covered by the
reference. Note that the numbering scheme is for the sequence as presented
in the database entry (i.e. from 5' to 3' starting at 1), not the scheme used
by the authors in the reference should the two differ. The RP line is
formatted thus:
RP i-j[, k-l...]
The RP line in the example above is:
RP 1-1859

3.4.10.4 The RX Line
The RX (reference cross-reference) linetype is an optional linetype which
contains a cross-reference to an external citation or abstract resource.
For example, if a journal citation exists in the PUBMED database, there will
be an RX line pointing to the relevant PUBMED identifier.
The format of the RX line is as follows:
RX resource_identifier; identifier.
The first item on the RX line, the resource identifier, is the abbreviated
name of the data collection to which reference is made. The current
set of cross-referenced resources is:
Resource ID Fullname
----------- ------------------------------------
PUBMED PUBMED bibliographic database (NLM)
DOI Digital Object Identifier (International DOI Foundation)
AGRICOLA US National Agriculture Library (NAL) of the US Department
of Agriculture (USDA)
The second item on the RX line, the identifier, is a pointer to the entry in
the external resource to which reference is being made. The data item used as
the primary identifier depends on the resource being referenced.
For example:
RX DOI; 10.1016/0024-3205(83)90010-3.
RX PUBMED; 264242.
Note that further details of DOI are available at http://www.doi.org/. URLs
formulated in the following way are resolved to the correct full text URLs:
http://dx.doi.org/<doi>
eg. http:/dx.doi.org/10.1016/0024-3205(83)90010-3

3.4.10.5 The RG Line
The RG (Reference Group) lines list the working groups/consortia that
produced the record. RG line is mainly used in submission reference
blocks, but could also be used in paper reference if the working group is
cited as an author in the paper.

3.4.10.6 The RA Line
The RA (Reference Author) lines list the authors of the paper (or other
work) cited. All of the authors are included, and are listed in the order
given in the paper. The names are listed surname first followed by a blank
followed by initial(s) with stops. Occasionally the initials may not
be known, in which case the surname alone will be listed. The author names
are separated by commas and terminated by a semicolon; they are not split
between lines. The RA line in the example is:
RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;
As many RA lines as necessary are included for each reference.

3.4.10.7 The RT Line
The RT (Reference Title) lines give the title of the paper (or other work) as
exactly as is possible given the limitations of computer character sets. Note
that the form used is that which would be used in a citation rather than that
displayed at the top of the published paper. For instance, where journals
capitalise major title words this is not preserved. The title is enclosed in
double quotes, and may be continued over several lines as necessary. The title
lines are terminated by a semicolon. The title lines from the example are:
RT "Nucleotide and derived amino acid sequence of the cyanogenic
RT beta-glucosidase (linamarase) from white clover (Trifolium repens L.)";
Greek letters in titles are spelled out; for example, a title in an entry
would contain "kappa-immunoglobulin" even though the letter itself may be
present in the original title. Similar simplifications have been made in
other cases (e.g. subscripts and superscripts). Note that the RT line of
a citation which has no title (such as a submission to the database) contains
only a semicolon.

3.4.10.8 The RL Line
The RL (Reference Location) line contains the conventional citation
information for the reference. In general, the RL lines alone are
sufficient to find the paper in question. They include the journal,
volume number, page range and year for each paper.
Journal names are abbreviated according to existing ISO standards
(International Standard Serial Number)
The format for the location lines is:
RL journal vol:pp-pp(year).
Thus, the reference location line in the example is:
RL Plant Mol. Biol. 17:209-219(1991).
Very occasionally a journal is encountered which does not consecutively
number pages within a volume, but rather starts the numbering anew for
each issue number. In this case the issue number must be included, and the
format becomes:
RL journal vol(no):pp-pp(year).

If a paper is in press, the RL line will appear with such information as
we have available, the missing items appearing as zeros. For example:
RL Nucleic Acids Res. 0:0-0(2004).
This indicates a paper which will be published in Nucleic Acids Research at some
point in 2004, for which we have no volume or page information. Such references
are updated to include the missing information when it becomes available.
Another variation of the RL line is used for papers found in books
or other similar publications, which are cited as shown below:
RA Birnstiel M., Portmann R., Busslinger M., Schaffner W.,
RA Probst E., Kressmeann A.;
RT "Functional organization of the histone genes in the
RT sea urchin Psammechinus: A progress report";
RL (in) Engberg J., Klenow H., Leick V. (Eds.);
RL SPECIFIC EUKARYOTIC GENES:117-132;
RL Munksgaard, Copenhagen (1979).
Note specifically that the line where one would normally encounter the
journal location is replaced with lines giving the bibliographic citation
of the book. The first RL line in this case contains the designation "(in)",
which indicates that this is a book reference.
The following examples illustrate RL line formats that are used for data
submissions:
RL Submitted (19-NOV-1990) to the INSDC.
RL M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE, MEDICAL SCHOOL, NEW
RL CASTLE UPON TYNE, NE2 4HH, UK
Submitter address is always included in new entries, but some older
submissions do not have this information.
RL lines take another form for thesis references.
For example:
RL Thesis (1999), Department of Genetics,
RL University of Cambridge, Cambridge, U.K.
For an unpublished reference, the RL line takes the following form:
RL Unpublished.
Patent references have the following form:
RL Patent number EP0238993-A/3, 30-SEP-1987.
RL BAYER AG.
The words "Patent number" are followed by the patent application number, the
patent type (separated by a hyphen), the sequence's serial number within the
patent (separated by a slash) and the patent application date. The subsequent RL
lines list the patent applicants, normally company names.
Finally, for journal publications where no ISSN number is available for the
journal (proceedings and abstracts, for example), the RL line contains the
designation "(misc)" as in the following example.
RL (misc) Proc. Vth Int. Symp. Biol. Terr. Isopods 2:365-380(2003).

3.4.11 The DR Line
The DR (Database Cross-reference) line cross-references other databases which
contain information related to the entry in which the DR line appears. For
example, if an annotated/assembled sequence in ENA is cited in the IMGT/LIGM
database there will be a DR line pointing to the relevant IMGT/LIGM entry.
The format of the DR line is as follows:
DR database_identifier; primary_identifier; secondary_identifier.
The first item on the DR line, the database identifier, is the abbreviated
name of the data collection to which reference is made.
The second item on the DR line, the primary identifier, is a pointer to
the entry in the external database to which reference is being made.
The third item on the DR line is the secondary identifier, if available, from
the referenced database.
An example of a DR line is shown below:
DR MGI; 98599; Tcrb-V4.

3.4.12 The AH Line (in TPA and TSA records only)
Third Party Annotation (TPA) and Transcriptome Shotgun Assembly (TSA) records
may include information on the composition of their sequences to show
which spans originated from which contributing primary sequences. The AH
(Assembly Header) line provides column headings for the assembly information.
The lines contain no data and may be ignored by computer programs.
The AH line format is:
AH LOCAL_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP

3.4.13 The AS Line (in TPA and TSA records)
The AS (ASsembly Information) lines provide information on the composition of
a TPA or TSA sequence. These lines include information on local sequence spans
(those spans seen in the sequence of the entry showing the AS lines) plus
identifiers and base spans of contributing primary sequences (for ENA
primary entries only).

a) LOCAL_SPAN base span on local sequence shown in entry
b) PRIMARY_IDENTIFIER acc.version of contributing ENA sequence(s)
or trace identifier for ENA read(s)
c) PRIMARY_SPAN base span on contributing ENA primary
sequence or not_available for ENA read(s)

d) COMP 'c' is used to indicate that contributing sequence
originates from complementary strand in primary
entry

Example:
AH LOCAL_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP
AS 1-426 AC004528.1 18665-19090
AS 427-526 AC001234.2 1-100 c
AS 527-1000 TI55475028 not_available

3.4.14 The CO Line (in CON records only)
Con(structed) sequences in the CON data classes represent complete
chromosomes, genomes and other long sequences constructed from segment entries.
CON data class entries do not contain sequence data per se, but rather the
assembly information on all accession.versions and sequence locations relevant
to building the constructed sequence. The assembly information is represented in
the CO lines.
Example:
CO join(Z99104.1:1..213080,Z99105.1:18431..221160,Z99106.1:13061..209100,
CO Z99107.1:11151..213190,Z99108.1:11071..208430,Z99109.1:11751..210440,
CO Z99110.1:15551..216750,Z99111.1:16351..208230,Z99112.1:4601..208780,
CO Z99113.1:26001..233780,Z99114.1:14811..207730,Z99115.1:12361..213680,
CO Z99116.1:13961..218470,Z99117.1:14281..213420,Z99118.1:17741..218410,
CO Z99119.1:15771..215640,Z99120.1:16411..217420,Z99121.1:14871..209510,
CO Z99122.1:11971..212610,Z99123.1:11301..212150,Z99124.1:11271..215534)

Gaps of undefined length are represented using the expression 'gap(unk100)'.
These gaps contribute to the sequence length for the entry (as shown in the
ID line).
Example: CO join(AL358912.1:1..39187,gap(unk100),AL137130.1:1..40815,...
Gaps of defined length are represented via 'gap(#)' where # is the
gap length. These gaps also contribute to the sequence length for the entry (as
shown in the ID line).
Example: CO AE005330.1:61..14164,AE005331.1:61..3773,gap(4001),...
Below are the relevant sections of a Bacillus subtilis CON entry providing
construct information for the assembly of the Bacillus subtilis genome.

ID AL009126; SV 2; circular; genomic DNA; CON; PRO; 4214630 BP.
XX
AC AL009126;
XX
DT 18-JUL-2002 (Rel. 72, Created)
DT 07-JUL-2003 (Rel. 76, Last updated, Version 3)
XX
DE Bacillus subtilis complete genome.
XX
KW complete genome.
XX
OS Bacillus subtilis subsp. subtilis str. 168
OC Bacteria; Firmicutes; Bacillales; Bacillaceae; Bacillus.
...
CITATION INFORMATION
...
FH Key Location/Qualifiers
FH
FT source 1..4214630
FT /organism="Bacillus subtilis subsp. subtilis str. 168"
FT /strain="168"
FT /mol_type="genomic DNA"
FT /db_xref="taxon:224308"
XX
CO join(Z99104.2:1..213080,Z99105.2:51..202768,Z99106.2:31..195912,
CO Z99107.2:51..202089,Z99108.2:51..197409,Z99109.2:41..198743,
CO Z99110.2:41..201241,Z99111.2:41..191980,Z99112.2:41..204263,
CO Z99113.2:41..207829,Z99114.2:41..192961,Z99115.2:51..201375,
CO Z99116.2:31..204537,Z99117.2:31..199173,Z99118.2:31..200707,
CO Z99119.2:51..199922,Z99120.2:51..201059,Z99121.2:51..194692,
CO Z99122.2:51..200690,Z99123.2:31..201139,Z99124.2:51..203901)
//

3.4.15 The FH Line
The FH (Feature Header) lines are present only to improve readability of
an entry when it is printed or displayed on a terminal screen. The lines
contain no data and may be ignored by computer programs. The format of these
lines is always the same:
FH Key Location/Qualifiers
FH
The first line provides column headings for the feature table, and the second
line serves as a spacer. If an entry contains no feature table
(i.e. no FT lines - see below), the FH lines will not appear.

3.4.16 The FT Line
The FT (Feature Table) lines provide a mechanism for the annotation of the
sequence data. Regions or sites in the sequence which are of interest are
listed in the table. In general, the features in the feature table represent
signals or other characteristics reported in the cited references. In some
cases, ambiguities or features noted in the course of data preparation have
been included. The feature table is subject to expansion or change as more
becomes known about a given sequence.
Feature Table Definition Document:
A complete and definitive description of the feature table is given
in the document "The DDBJ/ENA/GenBank Feature Table: Definition".
URL: ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/FT_current.txt
Much effort is expended in the design of the feature table to try to
ensure that it will be self-explanatory to the human reader, and we therefore
expect that the official definition document will be of interest mainly
to software developers rather than to end-users of the database.
A browser derived from the document is provided to assist users in navigating
and composing feature table representations at
http://www.ebi.ac.uk/ena/WebFeat/.

3.4.17 The SQ Line
The SQ (SeQuence header) line marks the beginning of the sequence data and
Gives a summary of its content. An example is:
SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
As shown, the line contains the length of the sequence in base pairs followed
by its base composition. Bases other than A, C, G and T are grouped
together as "other". (Note that "BP" is also used for single stranded RNA
sequences, which is not strictly accurate, but has been used for consistency
of format.) This information can be used as a check on accuracy or for
statistical purposes. The word "Sequence" is present solely as a marker for
readability.

3.4.18 The Sequence Data Line
The sequence data line has a line code consisting of two blanks. The sequence
is written 60 bases per line, in groups of 10 bases separated by a blank
character, beginning at position 6 of the line. The direction listed is
always 5' to 3', and wherever possible the non-coding strand
(homologous to the message) has been stored. Columns 73-80 of each
sequence line contain base numbers for easier reading and quick
location of regions of interest. The numbers are right justified and indicate
the number of the last base on each line.
An example of a data line is:
aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60

The characters used for the bases correspond to the IUPAC-IUB
Commission recommendations (see appendices).

3.4.19 The CC Line
CC lines are free text comments about the entry, and may be used to convey
any sort of information thought to be useful that is unsuitable for
inclusion in other line types.

3.4.20 The XX Line
The XX (spacer) line contains no data or comments. Its purpose is to make
an entry easier to read on a page or terminal screen by setting off the
various types of information in appropriate groupings. XX is used
instead of blank lines to avoid confusion with the sequence data lines.
The XX lines can always be ignored by computer programs.

3.4.21 The // Line
The // (terminator) line also contains no data or comments. It designates
the end of an entry.

APPENDIX A
STANDARD BASE CODES

These are the official IUPAC-IUB single-letter base codes (reference 1 below).

Code Base Description
---- --------------------------------------------------------------
G Guanine
A Adenine
T Thymine
C Cytosine
R Purine (A or G)
Y Pyrimidine (C or T or U)
M Amino (A or C)
K Ketone (G or T)
S Strong interaction (C or G)
W Weak interaction (A or T)
H Not-G (A or C or T) H follows G in the alphabet
B Not-A (C or G or T) B follows A
V Not-T (not-U) (A or C or G) V follows U
D Not-C (A or G or T) D follows C
N Any (A or C or G or T)
A-1

APPENDIX B
MODIFIED BASE CODES

The following table is taken from Sprinzl M. and Gauss D.H.
(reference 2 below). The codes appear in database entries as values for the
/mod_base qualifier in the feature table.

Code Modified Base
---- ------------------------------------------------------------
ac4c 4-acetylcytidine
chm5u 5-(carboxyhydroxylmethyl)uridine
cm 2'O-methylcytidine
cmnm5s2u 5-carboxymethylaminomethyl-2-thiouridine
cmnm5u 5-carboxymethylaminomethyluridine
dhu dihydrouridine
fm 2'-O-methylpseudouridine
gal q beta-D-galactosylqueuosine
gm 2'-O-methylguanosine
i inosine
i6a N6-isopentenyladenosine
m1a 1-methyladenosine
m1f 1-methylpseudouridine
m1g 1-methylguanosine
m1i 1-methylinosine
m22g 2,2-dimethylguanosine
m2a 2-methyladenosine
m2g 2-methylguanosine
m3c 3-methylcytidine
m4c N4-methylcytosine
m5c 5-methylcytidine
m6a N6-methyladenosine
m7g 7-methylguanosine
mam5u 5-methylaminomethyluridine
mam5s2u 5-methylaminomethyl-2-thiouridine
man q beta-D-mannosylqueuosine
mcm5s2u 5-methoxycarbonylmethyl-2-thiouridine
mcm5u 5-methoxycarbonylmethyluridine
mo5u 5-methoxyuridine
ms2i6a 2-methylthio-N6-isopentenyladenosine
ms2t6a N-((9-beta-D-ribofurnosyl-2-methylthiopurin-6-yl)carbamoyl)threonine
mt6a N-((9-beta-D-ribofuranosylpurine-6-yl)N-methyl-carbamoyl)threonine
mv uridine-5-oxoacetic acid methylester
o5u uridine-5-oxyacetic acid (v)
osyw wybutoxosine
p pseudouridine
q queuosine
s2c 2-thiocytidine
s2t 5-methyl-2-thiouridine
s2u 2-thiouridine
s4u 4-thiouridine
m5u 5-methyluridine
t6a N-((9-beta-D-ribofuranosylpurine-6-yl)carbamoyl)threonine
tm 2'-O-methyl-5-methyluridine
um 2'-O-methyluridine
yw wybutosine
x 3-(3-amino-3-carboxypropyl)uridine, (acp3)u
OTHER (requires /note= qualifier)

B-1

APPENDIX C
REFERENCES FOR ABBREVIATIONS AND SYMBOLS

1. Cornish-Bowden A., Nucl. Acids Res. 13:3021-3030(1985).
2. Sprinzl M., and Gauss D.H., "Compilation of tRNA Sequences",
Nucl. Acids Res. 10:r1-r55(1982).

C-1

Revised: 12-MARCH-2020