Data Addressing
The genome is very large. Similar in size to a blue ray disk in terms of capacity. Scientists have a bunch of ways to address this data. This article explains their way and our way.
Looking For Language
Our hunt here is looking for natural language encoded in the DNA system of the human genome. At the end of the hunt we expect to find parts of the genome that are encoding language.
At bottom of this hunt is a map from the codon table to the Paleo alphabet. This level of the problem is interesting too, but it is NOT our concern here.
Our concern here is how the immense amount of data in the human genome available in downloaded files can be mapped into codons. There are 12 different ways this can be done.
This has a real practical problem. There are already over 3 gigabytes of data in the downloads. That data can be read 12 different ways. So there are over 36 gigabytes of data that must be scanned in the process of finding language.
Until we know which way, or ways, were used for this encoding we cannot trust any specific way. Scientists don't do this either and have notation for indicating which information encoding system they may be using at any given time in the genome.
In summary, there are 2 strands of DNA, each of which can be run in 2 different directions. Finally the Nucleotide codes, TACG are bunched in sets of 3. This bunching can happen using any of 3 different reference frames.
In total there are 2 strands x 2 directions x 3 frames = 12 different systems that could hold encoded language.
In order to make the underlying software simpler, we will invent a new term, steps and assign step numbers 0 through 11 as markers for which of the 12 possible combinations are being used.
Our scanning code can then run all possible steps. Once we learn more about where language is encoded, we can then focus on only steps that are known to be interesting, saving computer time.
To begin, let me review the standard terms used in the genome for how these address systems actually work.
Strand
The famous DNA molecule is made up of 2 strands. Those strands are not exactly the same. One of the strands is called the sense strand. The other strand is called the antisense strand.
Data as downloaded from the internet is referenced against the sense strand.
Note that the anti-sense strand matches in pairs against the sense strand but with different nucleotides. The following table shows the pairing, and thus transformation that the sense data must go through to produce antisense data.
Nucleotide Pairing | |
---|---|
Sense | Antisense |
T | A |
A | T |
G | C |
C | G |
Direction
Each strand has direction. There are special molecules at the end of each strand. Those ends are chemically different. The start of a DNA strand is known as the 5' (five prime) end. This is because it has a molecule composed of 5' phosphate.
The stop end of an DNA strand is called the 3' (three prime) end of the strand. This is because there is 3' hydroxyl group at that end of the strand.
So DNA data as downloaded is given to us in the 5' to 3' direction.
BUT, it can also be referenced in the reverse direction, or the 3' to 5' direction. This is seen biologists at times and thought to be a genetic defect. Our concern is the reverse direction might be used for language encoding.
As the 2 strands of DNA are matched to each other, the 5' or start of the sense strand matches to the 3' end of the antisense strand.
This means the normal reading direction is down the sense strand and then back up along the antisense strand.
Reference Frames
The DNA molecule encodes amino acids using 3 nucleotides. Stated simply the rungs of the DNA molecule are grouped into sets of 3.
The starting rung in the downloaded data is NOT always the start of the grouping. The natural grouping, starting with the first data in the file, is reference frame 0. This implies no offset from the natural data.
But, alternatively, reference frame 1 implies a 1 nucleotide offset.
Finally, reference frame 2 implies a 2 nucleotide offset.
Reference frame 3 does not exist because that is the same as reference frame 0. it folds back on the first frame.
Steps, Our Notation
In order to simplify complex code and in order to make discussion about this problem easier, the 12 cases implied in the discussion above are given specific step numbers to make this complexity easier to annotate.
The following table shows the step codes and how they map to the cases above.
Step Code To Genome Map | |||
---|---|---|---|
Code | Strand | Direction | Frame |
0 | sense | 5' to 3' | 0 |
1 | sense | 5' to 3' | 1 |
2 | sense | 5' to 3' | 2 |
3 | sense | 3' to 5' | 0 |
4 | sense | 3' to 5' | 1 |
5 | sense | 3' to 5' | 2 |
6 | antisense | 5' to 3' | 0 |
7 | antisense | 5' to 3' | 1 |
8 | antisense | 5' to 3' | 2 |
9 | antisense | 3' to 5' | 0 |
10 | antisense | 3' to 5' | 1 |
11 | antisense | 3' to 5' | 2 |
Data from this table will annotate results tables as needed across our software tools, tests and audits.
Notice that step 0 above involves no transformations. This is the most common way of looking at the data and why the downloaded files are encoding as step 0 data.