The GENCODE annotation is our primary resource for gene, transcript and exon boundary definition on the human and mouse genome.
Internally, we keep a network share for flatfiles that allows every member of our group to work on the same set of files. Since GENCODE releases new annotations
about two times a year, our flatfile share has to keep up to date, too. This post will show how we minimized the work to switch to a new GENCODE release.
GENCODE annotations come in various data files, available in both GTF and GFF3 format.
The main files that we use are the basic gene annotation and information on long non-coding RNAs. However, for some of our pipelines, we need subsets of these files, e.g. containing only
exons, or files having a different data format (e.g. BED) - which is possible by parsing with simple linux commands and/or awk. To minimize the steps to recreate needed files for every new release,
we made use of GNU’s make utility. Makefiles are like recipes, containing information on how to create one or more files from a number of other files.
For example, to create a BED 6 file from a GTF file, I would run awk on the GTF to extract the neccesary information:
Similarly, we collected a bunch of command line file transformations to download, extract, filter and reformat files from any GENCODE release:
If you have the make utility installed and save the above code in a file called Makefile, a simple call like
will download, extract and reformat some files from the current GENCODE release vM6. The Makefile can easily be adjusted to work for human releases, too!