In order to provide stability and ensure that old analyses can be reproduced, both genome browsers make available not only the current version of the genome assemblies but older ones as well. In addition, annotation tracks, such as the GENCODE gene track and the SNP track, may be based on different versions of the underlying data. Thus, users are encouraged to verify the version of all data (both genome assembly and annotations) when comparing a region of interest between the UCSC and Ensembl Genome Browsers.
This chapter presents general guidelines for accessing the genome sequence and annotations using the UCSC and Ensembl Genome Browsers. Although similar analyses could be carried out with either browser, we have chosen to use different examples at the two sites to illustrate different types of questions that a researcher might want to ask. We finish with a short description of JBrowse (Buels et al. 2016), another web-based genome browser that users can set up on their own servers to share custom genome assemblies and annotations. All of the resources discussed in this chapter are freely available.
After starting in 2000 with just a display of an early draft of the human genome assembly, the UCSC Genome Browser now provides access to assemblies and annotations from over 100 organisms (Haeussler et al. 2019). The majority of assemblies are of mammalian genomes, but other vertebrates, insects, nematodes, deuterostomes, and the Ebola virus are also included. The assemblies from some organisms, including human and mouse, are available in multiple versions. New organisms and assembly versions are added regularly.
The UCSC Browser presents genomic annotation in the form of tracks. Each track provides a different type of feature, from genes to SNPs to predicted gene regulatory regions to expression data. Each organism has its own set of tracks, some created by the UCSC Genome Bioinformatics team and others provided by members of the bioinformatics community. Over 200 tracks are available for the GRCh37 version of the human genome assembly. The newer human genome assembly, GRCh38, has fewer tracks, as not all the data have been remapped from the older assembly. Other genomes are not as well annotated as human; for example, fewer than 20 tracks are available for the sea hare. Some tracks, such as those created from NCBI transcript data, are updated weekly, while others, such as the SNP tracks created from NCBI variant data (Sayers et al. 2019), are updated less frequently, depending on the release schedule of the underlying data. For ease of use, tracks are organized into subsections. For example, depending on the organism, the Genes and Gene Predictions section may include evidence-based gene predictions, ab initio gene predictions, and/or alignment of protein sequences from other species.
The home page of the UCSC Genome Browser provides a stepping-off point for many of the resources developed by the Genome Bioinformatics group at UCSC, including the Genome Browser, BLAT, and the Table Browser, which will be described in detail later in this chapter. The Tools menu provides a link to liftOver , a widely used tool that converts genomic coordinates from one assembly to another. Using this tool, it is possible to update annotation files so that old data can be integrated into a new genome assembly. The Download menu provides an option to download all the sequence and annotation data for each genome assembly hosted by UCSC, as well as some of the source code. The What's New section provides updates on new genome assemblies, as well as new tools and features. Finally, there is an extensive Help menu, with detailed documentation as well as videos. Users may also submit questions to a mailing list, and most queries are answered within a day.
The UCSC Genome Browser provides multiple ways for both individual users and larger genome centers to share data with collaborators or even the entire bioinformatics community. These sharing options are available on the My Data link on the home page. Custom Tracks allow users to display their own data as a separate annotation track in the browser. User data must be formatted in a standard data structure in order to be interpreted correctly by the browser. Many commonly used file formats are supported, including Browser Extensible Data (BED), Binary Alignment/Map (BAM), and Variant Call Format (VCF; Box 4.1). Small data files can be uploaded or pasted into the Genome Browser for personal use. Larger files must be saved on the user's web server and accessed by URL through the Genome Browser. As anyone with the URL can access the data, this method can be used to share data with collaborators. Alternatively, Custom Tracks , along with track configurations and settings, can be shared with selected collaborators using a named Session . Some groups choose to make their Sessions available to the world at large in My Data → Public Sessions . Finally, groups with very large datasets can host their data in the form of a Track Hub so that it can be viewed on the UCSC Genome Browser. When a Track Hub is paired with an Assembly Hub , it can be used to create a browser for a genome assembly not already hosted by UCSC.
Box 4.1Common File Types for Genomic Data
Both the UCSC and Ensembl Genome Browsers allow users to upload their own data so that they can be viewed in context with other genome-scale data. User data must be formatted in a commonly used data structure in order to be interpreted correctly by the browser.
Browser Extensible Data (BED) format is a tab-delimited format that is flexible enough to display many types of data. It can be used to display fairly simple features like the location of transcription binding factor sites, as well more complex ones like transcripts and their exons.
Binary Alignment/Map (BAM) format is the compressed binary version of the Sequence Alignment/Map (SAM) format. It is a compact format designed for use with very large files of nucleotide sequence alignments. Because it can be indexed, only the portion of the file that is needed for display is transferred to the browser. Many tools for next generation sequence analysis use BAM format as output or input.
Variant Call Format (VCF) is a flexible format for large files of variation data including single-nucleotide variants, insertions/deletions, copy number variants, and structural variants. Like BAM format, it is compressed and indexed, and only the portion of the file that is needed for display is transferred to the browser. Many tools for variant analysis use VCF format as output or input.
The UCSC Genome Browser home page lists commonly accessed tools, as well as a frequently updated news section that highlights major data and software updates. To reach the Genome Browser Gateway, the main entry point for text-based searches, click on the Gateway link on the home page ( Figure 4.1). The default assembly is the most recent human assembly, GRCh38, from December 2013. The genomes of other species can be selected from the phylogenetic tree on the left side of the Gateway page, or by typing their name in the selection box. On the human Gateway page, there is also the option to select one of four older human genome assemblies. Details about the GRCh38 assembly and instructions for searching are available on the Gateway page.
To perform a search, enter text into the Position/Search Term box. If the query maps to a unique position in the genome, such as a search for a particular chromosome and position, the Go button links directly to the Genome Browser. However, if there is more than one hit for the query, such as a search for the term metalloprotease
, the resulting page will contain a list of results that all contain that term. For some species, the terms have been indexed, and typing a gene symbol into the search box will bring up a list of possible matches. In this example, we will search for the human hypoxia inducible factor 1 alpha subunit ( HIF1A ) gene ( Figure 4.1), which produces a single hit on GRCh38.
Читать дальше