Learning Circos

Bioinformatics and Genome Analysis

Institut Pasteur Tunis Tunisia, December 10 – December 11, 2018

Download course materials
v2.00 6 Dec 2018 Download PDF slides
v2.00 6 Dec 2018

A 2- or 4-day practical mini-course in Circos, command-line parsing and scripting. This material is part of the Bioinformatics and Genome Analysis course held at the Institut Pasteur Tunis.

Quick links

menu day 1 day 2 day 3 day 4 scripts references updates

sessions / day.3

Visualizing and Parsing clustal alignments

Additional material — Day 3

9h00 - 10h30 | Lecture 1 — Visualization strategies

11h00 - 12h30 | Lecture (practical) 2 — Parsing clustal alignments on command line

14h00 - 15h30 | Lecture (practical) 3 — Parsing clustal alignments with Perl

16h00 - 18h00 | Lecture (practical) 4 — Plotting clustal alignments

Concepts covered

designing effective visualizations, parsing clustal alignments, advanced Perl techniques, using Circos tools

menu lecture 1 lecture 2 lecture 3 lecture 4

sessions / day.3 / lecture.3

Parsing clustal alignments with Perl

sessions / day.3 / lecture.3 / README

Let's get into somewhat more advanced Perl.

sessions / day.3 / lecture.3 / 1 / README

Let's get started writing a Perl script to parse the clustal alignments.

Although you can already use clustal2link, often you'll need to parse your own files. In the previous lecture you've seen how you can do this on the command line.

Now, using the boilerplate Perl script in this directory clustalparser you will build up a more functional parser.

The script already has some functionality. It supports usage and man page flags (you'll need to change the output of these in the script to reflect the functionality you'll be adding)

>clustalparser -h
>clustalparser -man

and already parse the file into a list of hashes and dumps the file structure to STDOUT.

>cat ../../data/alignment.aln | ./clustalparser

# or

>./clustalparser -in ../../data/alignment.aln

$VAR1 = [
          {
            'cumul' => '',
            'id' => 'hsaA',
            'seq' => '------------------------------------------------------------'
          },
          {
            'cumul' => '',
            'id' => 'dreA',
            'seq' => '------------------------------------------------------------'
         },
          {
            'cumul' => '',
            'id' => 'rnoA',
            'seq' => '------------------------------------------------------------'
          },
         ...

This data structure is probably not what you want — it's in the script to serve as an example and starting point. If you need a quick refresher in Perl data structures, see Day 2 Lecture 1.

Script specifications

Take a look at the original alignment file in ../../data/alignment.aln and then take a look at what clustal2link created in Lecture 2.

# ../../lecture.2/1/data/links.txt
celA 1 9 hsa8K 581 589
celA 10 16 hsa8K 603 609
celA 17 21 hsa8K 659 663
celA 22 28 hsa8K 781 787
celA 29 31 hsa8K 811 813
...

# ../../lecture.2/1/data/karyotype.txt
chr - celA cela 0 263 cela
chr - dmeA dmea 0 979 dmea
chr - dreA drea 0 421 drea
chr - dreB dreb 0 3247 dreb
chr - ggaA ggaa 0 312 ggaa
chr - ggaB ggab 0 466 ggab
chr - hsa8K hsa8k 0 8268 hsa8k
chr - hsaA hsaa 0 2384 hsaa
chr - hsaD hsad 0 426 hsad
chr - rnoA rnoa 0 2352 rnoa
chr - rnoB rnob 0 1221 rnob

Your first version of the script should generate these two files. You should remind yourself that although the karyotype file identifies each entry as chr, this is just an arbitrary string and the entry defines an axis on which data are drawn. This axis could be anything: a chromosome, a contig, a scaffold, or just a number line.

The fact that the karyotype file starts with chr - is for historical (and unimportant) reasons.

The last field in the karyotype file is the color, which is assigned a name based on the sequence id. These colors need to be defined, though, and you'll see that in

#../../lecture.2/1/etc/seq.colors.conf
hsa8k = black
hsaa  = paired-12-qual-1
hsad  = paired-12-qual-2
rnoa  = paired-12-qual-3
rnob  = paired-12-qual-4
ggaa  = paired-12-qual-5
ggab  = paired-12-qual-6
drea  = paired-12-qual-7
dreb  = paired-12-qual-8
dmea  = paired-12-qual-10
cela  = paired-12-qual-12

where paired-12-qual-N are the colors from the 12-color qualitative paired Brewer palette. Recall that in Day 1 Lecture 3 we saw how Brewer palettes and colors were defined in Circos.

The colors aren't that important for now and you can just assign each sequence ideogram to be black.

One way to parse the data is to assume that each of the 11 sequences has the same length as the total length of the alignment. This doesn't have to be the case since the sequences can also be shorter.

This approach allows you to have a start coordinate of each sequence located at the start of the alignment. You basically have 11 strings (or in general n strings) and you want to find the positions at which the strings have a base, which indicates an alignment.

There are many ways to do this and to implement this. Below, I suggest a simple method. You can choose your own approach, if you like.

Step 1. Concatenate the sequence lines

Parse out the sequence lines for each entry and concatenate them together. This will give you a string of length 8,268 for each sequence id.

You can store this string in a hash, keyed by the sequence id.

Step 2. Iterate over each string position and check which sequence string has a base.

You can access the character (or any string of $len) at position $i in the string with

 substr($string,$i,$len)

So a loop like

 for $i (0..length($string)-1) {
    my $base = substr($string,$i,1);
   ...
 }

loops over characters in the $string at each position.

You'll want to wrap this loop in an outer loop that iterates over the sequences, so that you're accessing each sequence $string one at a time.

Assume that hsa8K is the reference, so each time you see a base in hsa8K and in another sequence then you'll need to add that position to the alignment between the two. For example, if had

       1234567
 seq1  AA-CCCG 5
 seq2  -T--CCG 5

then the alignments would be at position 2 and 5-7 and the link lines for this might look like

 seq1 2 2 seq2 2 2
 seq1 5 7 seq2 5 7

An easy way to get these start and end positions of alignments is to make a hash of all the positions at which you see a base for each sequence. This isn't very memory efficient or fast, but it doesn't matter for us since the sequences are so short.

If you see a base at position $i for sequence $id, then you would set

 $alignment->{$id}{$base} = 1

You can then compile a list of base positions that are letters in any pair of sequences, $id1 and $id2

 my @idx_id1 = keys %{$alignment->{$id}};
 my @idx_id1_id2 = grep($alignment->{$id2}{$_} @idx_idx1);

Or if you're feeling very Perlish, do it on one line

 @idx_id1_id2 = grep($alignment->{$id2}{$_}, keys %{$alignment->{$id}});

Now that you have this list, you need to turn it into a series of ranges. This could be a list of [min,max] pairs of each alignment. For (2,5,6,7,...) you'd want ([2,2],[5,7],...).

I've helped you here by writing a list2ranges() function in the script already. You pass a list and it returns a list of ranges.

 my @x = qw(0 1 2 3 4 5 7 8 9 10 11 12 13 14 16 18 19);
 my $ranges = list2ranges(@x);

and $ranges is the list of lists

 [ [ 0,5 ], [ 6,13 ], [ 14,14 ], [ 15,16 ] ]

For example list items with index 6..13 (0-indexed) are the contiguous series (7,8,9,10,11,12,13,14).

You would iterate over all pairs of sequences, given a reference

 my $ref = "hsa8K";
 for my $id (keys %$alignment) {
   next if $id eq $ref;
   my @idx = grep($alignment->{$id}{$_}, keys %{$alignment->{$ref}});
   my $ranges = list2ranges(@idx)

   # generate link file between hsa8K and sequence $id
   for $range (@$ranges) {
    my ($min,$max) = @$range;
    # appropriate print statement here
   }
 }

sessions / day.3 / lecture.3 / 1 / clustalparser

↓

# This is a boilerplate Perl script for you to edit to build up a clustal parser
#
# Run it first as
#
# ./clustalpaser -h
#
# and then using the alignment file
#
# cat ../../data/alignment.aln | ./clustalparser

my $data = read_file($CONF{in});

# dump the data structure we created from the
# alignments to STDOUT

printdumper($data);

# Continue with the script here

# ...


# Convert a list of numbers into a
# list of contiguous ranges defined by [min,max] 
#
# [ [min1,max1], [min2,max2], ...]
sub list2ranges {
  my @x = sort {$a <=> $b} @_;
  my $ranges = [];
  # I wrote two approaches to solve this problem. The first
  # is more like C code and the second is more Perlish.
  # Change the 1 to a 0 in the outer if loop to use the second approach.
  # Which one reads easier?
  if(1) {
    # This one is more like C
    for ( my $i = 0; $i < @x; $i++) {
      my $min = $i;
      for my $j ($i+1..@x-1) {
        if($x[$j] - $x[$j-1] > 1) {
          last;
        }
        $i = $j;
      }
      push @$ranges, [$min,$i];
    }
  } else {
    # This one is more Perlish
    my $i = 0;
    my $max;
    while(@x) {
      my $x = shift @x;
      if(! defined $ranges->[-1] || $x - $max > 1) {
        push @$ranges, [$i,$i];
      } else {
        $ranges->[-1][1] = $i;
      }
      $max = $x;
      $i++;
    }
  }
  return $ranges;
}

# Read the alignment file. For now, all it does
# is read the alignment lines and store them
# as a hash, keyed by the sequence id.
sub read_file {
  my $file        = shift;
  my $inputhandle = get_handle($file);
  my $data;
  while(<$inputhandle>) {
    # skip the first line, which is a header
    # this is one of those rare times that a special variable
    # other than $_ or @_ is usedful. The $. gives the
    # line number of the file handle that is currently being read.
    next if $. == 1;
    # skip comments and blank lines
    next if /^\s*\#/ ||/^\s*$/;
    # trim newline
    chomp;
    # you now have the fields of this line
    my ($id,$seq,$cumul) = split;
    # put these values into a hash and push it onto the list
    push @$data, {id=>$id,
                  seq=>$seq,
                  cumul=>$cumul};
                  
  }
  return $data;
}