Introducing CHM13
The basic human genome sequence we use in this project is named CHM13. We use version 2.0 for our work. This article introduces and provides links to download sources.
Earlier Sequences
The first publicly funded human genome sequencing project completed in 2003. That work produced a reference human genome. Reference genomes are used in various work flows involving the study of the genome.
That sequence was never finished. Here is an article on the point.
Completing the human genome sequence (genome.gov)
As that article explains, in 2003 equipment of the time could only sequence 5000 base pairs. Software then had to put them back in order. This works unless there are highly repetitive sections, which there are. Modern equipment can handle 100,000 base pairs. So highly repetitive areas can now be sequenced. Even with new equipment, the time involved for the last 8 percent was twice the time for the first 92 percent. It took a dedicated team to do it.
Use of a Reference Genome
It is usually easier to study a problem by difference from a reference genome than it is to study a genome from scratch. This is because a considerable body of knowledge has been built up against reference genomes. So the study of a specific genome is less work and cost by studying how it differs from a known reference.
Many of the standard software tools for dealing with genomes expect a reference genome as input. Those tools then take new sequence data from a sequencing machine and compare it to the reference.
Indeed, sequencing machines themselves break the problem into bits to make the process go faster. Output from sequencing machines usually requires a reference genome to put their sequence back together in order.
Once through this process a scientist who wants to study a novel sample can then do further study. Especially by looking at how their sample differs from the reference. Considering the scale of the human genome, over 1 billion codons, this is usually the only practical way to attack problems.
Here is a link to an article that explains some of the reference genomes, and their names, and difference from hg16 in 2006 through to GRCh38 which was used going into 2022.
Get to know your reference genome (Bite Size Bio)
So the reference genome produced in 2003 was the starting point for much further work. By 2009 a reference called GRCh37 became available, and then by 2013 GRCh38 was released. Patch level 14 of GRCh38 was made available in 2021.
This family of sequences was never completely finished. In 2003 only 92 percent of the human genome was finished. The problem was the complexity of the remaining 8 percent. It was going to take time and new techniques to sequence those remaining parts.
By January of 2022, a new reference genome had finally been sequenced multiple times, from scratch, and made available to the public. It has a name, CHM13. Here is 1 of the places on the internet that holds that reference genome, a page on github.
CHM13 (github)
That page has quite a few resources for understanding the human genome. The first section is a series of articles about this reference genome and I would recommend the top linked article there, from Science.
The complete sequence of the human genome (Science)
This article explains how this reference genome is better than those that have been used since 2003. I won't repeat details here.
Other Resources
Scrolling down on that github page is a section called Assembly Sequences. In that section is a link to file chm13v2.0.fa.gz. This is the starting file for our work to look for inspired bible fragments in the human genome.