3 Data structures
3.1 Table relationships
This database contains several tables. The ERD for age (Figure 1) shows the physical data model structure [Also known as a database schema.] of the database and its entities (each entity is implemented as a database table) and relationships between these tables. Each table represents an object, event, or concept in the real world that has been represented in the database. Each attribute of a table is a defining property or quality of the table.
All of the table's attributes are shown in the ERD. The underlined attributes represent the table's primary key [A primary key is an attribute or a combination of attributes that contains an unqiue value to identify that record.]. This schema is valid regardless of the database system chosen, and it can remain correct even if the Database Management System (DBMS) is changed.
Note that Figure 1 shows the main tables only. Most of the tables in the age database have some attributes, called foreign keys [A foreign key is any attribute, or a combination of attributes, in a table that is a primary key of another table. Tables are linked together through foreign keys.], which contain standard NIWA fisheries codes, such as species and age_meth. These attributes provide links to tables in age and the rdb (research database) database, which contains the definitive list of standard codes. Therefore, an expanded ERD for these tables will follow (Figures 2 - 3).
Section 5 shows a listing of all the age tables as implemented by the Empress DBMS.
As can be seen in the listing of the tables, a table's primary key has an unique index on it. Primary keys are generally listed using the format:
Indices: |
UNIQUE index_name ON |
(attribute [, attributes ]) |
where the attribute(s) make up the primary key and the index name is the primary key name. Note that the typographical convention for the above (and subsequent) format is the square brackets [ ] may contain an item that is repeated zero or more times.
This unique index prevents records with duplicate key values from being inserted into the table, e.g., a new trip with an existing trip code, and hence ensures that every record can be uniquely identified.
The age database is implemented as a relational database. That is, each table is a special case of a mathematical construct known as a relation and hence elementary relation theory is used to deal with the data within tables and their relationships between them. All relationships in age are of the type one-to-many [A one-to-many relationship is where one record in a table (the parent) relates to one or many records in another table (the child).]. This is shown in the ERD by connecting a single line (indicating "many") from the child table (e.g., t_reading) to the parent table (e.g., t_fish) with an arrow-head (indicating "one") pointing to the parent.
Every relationship has a mandatory or optional aspect to it. That is, if a relationship is mandatory, then it has to occur and least once, while an optional relationship might not occur at all. For example, in Figure 1, consider that relationship between the table t_fish and it's child table t_reading. The symbol "O" by the child t_reading means that a fish can have zero or many age readings, while the bar by the parent t_fish means that for every age reading there must be a matching fish.
Most of these tables contain foreign keys, which link these tables to each other and to tables in the rdb database (Figure 3). The majority of these links are enforced by referential constraints [Also known as integrity checks ]. Constraints do not allow orphans to exist in any table, i.e., where a child record exists without a related parent record. This may happen when: a parent record is deleted; the parent record is altered so that the relationship is lost; or a child record is entered without a parent record. Constraints are shown in the table listings by the following format:
Referential: |
error message (attribute[, attribute]) INSERT|DELETE parent table (attribute[, attribute]) |
For example, consider the following constraint found in the table t_catalog:
Referential: |
invalid ageing material code (material) INSERT t_material (code) |
This means that the value of the attribute material of a record upon insert into t_catalog must already exist in the parent table t_material or the record will be rejected and an error message will be displayed.
A delete constraint implies that for a record to be deleted from a table, the values of the constrained attributes must not be equal to the values of the corresponding attributes in any record of the constraining table. This is used to prevent a parent record from being deleted while child records still exist.
All tables in this database are indexed. That is, attributes that are most likely to be used as a searching key have like values linked together to speed up searches. These indices are listed using the following format:
Indices: NORMAL (2, 15) index_name ON (attribute[, attribute])
Note that indices may be simple, pointing to one attribute or composite pointing to more than one attribute. The numbers "... (2, 15)..." in the syntax are Empress DBMS default values relating to the amount of space allocated for the index.
3.2 Database design
This ageing database can be split into several main areas, each with properties that are important to record:
Details about the fish. These details include biological measurements of the fish, e.g., sex, length, etc.
Details about the ageing material extracted from the fish.
The readings made on the ageing material to determine the age of the fish. One fish may have many items of ageing material.
The current location of the ageing material and its status.
The agreed age of the fish, based on different materials and reading methods.
These properties represent the main tables in this database.
At the top level we have the trip from which ageing material was collected. Ideally, ageing material should be catalogued down to an individual fish (ref. Table 4 - t_catalog). However, more often than not, ageing material from trip is stored in a location until such time that it may be required for research (this can sometimes never happen). Therefore, the location details and status of a collection of ageing material for each species collected from trips are held in the table t_trip (Table 1).
For those collections that are used, the details and biological measurements of the fish are held in the table t_fish (Table 2). This table has a composite primary key of trip_code, sample_no, sub_sample_no, and fish_no to identify uniquely each fish. Apart from the key, the sex and length of the fish are the most common data held in this table, although other information such as the weight of the fish and measurements of the otolith can be held also. Up to two types of ageing material can be taken from any one fish, these being recorded by the attributes material1 and material2. The expanded ERD (Figure 3) shows that this table contains three attributes that are link through referential constraints to master code tables. To aid in locating trips the attribute origin is included. This stores a 3-character code, which describes the origin of the fish.
The attributes material1 and material2 contain codes which identify which material was taken from the fish for ageing purposes, e.g., otoliths, scales, spines, etc. It is assumed that no more than type of ageing material is taken from any one fish.
A problem arises in the age database because the concept of the sub-sample (listed as sub_sample_no in age), is not used at all for the SOP, and only sometimes for the SMP. Because the possibility exists that it may be used, it must remain a part of the primary key for the four main tables (see Section 3.3 "Handling different sampling programmes" on page 10). This can result in the presence of null values as part of some primary keys, and by definition a primary key cannot contain null values. Without primary keys, this database implementation can suffer due to the possibility of allowing duplicate records to enter. To overcome this, all null values are replaced with the value of -1 for the attribute sub_sample_no. This allows primary keys to be constructed on all the tables.
The next table t_reading (Table 3) records the information for one reading of the material that was used to determine the age, such as the reading method used, who the reader was, and the results generated. Figure 3 shows the expanded ERD for t_reading. This table inherits the same primary key as t_fish to identify the fish, as well as the additional attribute reading_no to further identify individual age readings on the fish. There are four attributes of this table that are linked to master code tables (Figure 3). We have discussed the attributes origin and material in t_fish above. Two other codes are the attribute method, which identifies the preparation and reading methods employed, and the attribute reader, which identifies just who made the reading. The most common reading method is counting rings although one reading may also be a measurement such as length or otolith weight. Note that measurements such as otolith weight are also an attribute of t_fish. Although this data may be duplicated, the distinction is that if it has been used to determine an age, then it is a reading. If not, then it is just an attribute of the fish.
Usually, one reading generates one result with an error flag to denote the results accuracy. However, in some extreme cases, one reading may generate multiple results and error flags; e.g., a reading of an otoliths' radius can give the measurement to the year 1 annulus, year 2 annulus, and to the otolith edge. Ideally, in a 3NF data model this brings in another entity "result" which can be represented by a table t_result. The entity "result" will have an one-to-many relationship to the t_reading table. This table will store the t_reading primary key, a code to say what the result represents, and the value of the result. There will also be a table t_result_code to store the descriptions of the result codes. The table structures for these two new tables would look something like:
T_RESULT |
T_RESULT_CODES |
T_READING |
PKEY RESULT_CODE |
RESULT_CODE |
RESULT_DESCRIPTION |
RESULT_VALUE |
|
A decision was made not to implement this new entity, but rather to take the simpler approach to have multiple result and error attributes in t_reading. This is because each result and error is relevant only when compared to others to the same reading.
Therefore, queries for these data sets need return all the results and error flags for a reading in the same tuple. In addition, the performance hit of performing multi-table joins in a 3NF scenario would not be supported by the users.
This implementation of t_reading does mean that there is a problem of users knowing which result value got into which result field. This is addressed two ways:
The results returned are functionally dependent on the aging method as denoted by the method attribute in t_reading. Hence, a full description of the results return and which attributes they are stored in is given in the t_age_meth table and are listed in Appendix 1.
Result types are consistently stored in the same attribute for each given aging method.
Current location and status of the ageing material is held in the table t_catalog (Table 4). Again, this table inherits most of its primary key from t_fish, as well as the additional attribute material to further identify which piece of ageing material from the fish is being cataloged. Only two attributes of t_catalog are linked to master code tables (Figure 3), being origin and material. Details such as room number, and if necessary, filing cabinet number can be recorded, as well as the current status, e.g., "being read" or "missing", and the date the status was last updated. This table also has an one-to-many relationship with t_fish. Any one fish can produce several (although usually one) type of ageing material and each type can be stored in different locations or have a different status.
The agreed age of the fish is held in t_age (Table 5). An agreed age is the age in years that has been agreed upon as the age of the fish, based on the various readings made on the ageing material(s). The expanded ERD (Figure 3) shows that only tow attributes of t_age are linked to master code tables. These attributes, origin and method, occur through most of the main tables and have explained earlier. It is important to note that this table carries not only the age, but also the agreed result from which the age was based. A fish age is usually calculated from an algorithm applied to the agreed result. This algorithm can be recorded in the attribute comments.
The attribute method denotes the reading method used from which the age was agreed on. The agreed age can be based on: all methods, i.e., one age for the fish; or a single method, i.e., one fish has many ages based different ageing methods.
An important relationship to note is that between t_trip to other sampling databases (Figure 2) so that ageing data can be matched against spatial, temporal, environmental, and biological data. Currently, from t_trip, we can link to the tables t_trip from the trawl database, t_trip from the obs_lfs database, or t_landing from the market database. This relationship means that any one fish in the database can be linked through the attribute trip_code to the catch and landing records held in these other databases.
There are four other tables in this database that describe the various codes used: the codes used in describing the origin of the ageing material are listed in the table t_origin (Table 6); the various material used for ageing are listed in the table t_material (Table 7); the names of the readers are listed in the table t_reader (Table 8); and the different method used to prepare and read the ageing material are listed in the table t_age_meth (Table 9). They all have only two attributes - one for the code and another for a brief description of the codes.
3.3 Handling different sampling programmes
Given the variety of sources for the ageing material, the primary key for each fish involves many attributes. Each of the three sources (SOP, SMP, and Research Voyages) involves a fishing trip as their basic unit, but have their own format of identifying codes. To create a common code fishing trip, the following attributes are used: origin (the origin of the ageing material, a three character uppercase code for either the sampling programme, e.g., SOP, SMP or the vessel name, e.g., TAN, KAH ), yr (the year of the fishing trip), trip_code (fishing trip identifier), sample_no (tow number for SOP, stratum number for SMP or station number for Research Voyages), sub_sample_no (not used for the SOP, cluster number for SMP or sample number for Research Voyages). From this trip code the fish can be identified using the attributes species} and fish_no.
For example, consider the following identifiers:
Fishing trip identifier
921301 |
for the SMP |
kah9203 |
for Research voyages |
551 |
for the SOP |
This would be stored as:
Origin |
yr |
trip_code |
SMP KAH SOP |
1992 1992 1 992 |
921301 kah9203 551 |
Important: the attribute trip_code is a character field, unlike the trip_no and landing_no fields in the obs and market databases respectively. Therefore, all trip codes must have quotes around them when using an SQL SELECT statement. For example, to select ageing data from t_reading for red cod market sample number 920801, we would use:
select from t_reading where trip_code = "920801"...
A further problem arises because both the SMP and Research programmes use a twotier strategy to get to an individual fish, but the SOP only uses one. To make matters worse, each programme has their own nomenclature to describe each tier. To handle this, the age database has used generic attribute names. The following table shows that relationship between the generic attributes names of age to the attributes specific to each programme:
Ageing Database age |
SMP Database market |
SOP obs_lfs |
Database Research Trawl trawl |
trip_code sample_no sub_sample_no |
sample_no stratum_no cluster_no |
tripnumber townumber |
trip_code station_no subcatch_no |
Note that theoretically the sub_catch_no is null for certain types of records.
3.4 Null Values and sub_catch_no in the Primary Key
It is important to note that in instances where sub_sample_no should be null, the attribute will be assigned the value of "-" This value can be ignored; its only purpose is to allow for a primary key to implemented.