Ruby pack unpack

 
 

C programming language allows developers to directly access the memory where variables are stored. Ruby does not allow that. There are times while working in Ruby when you need to access the underlying bits and bytes. Ruby provides two methods pack and unpack for that.

Here is an example.

> 'A'.unpack('b*')
=> ["10000010"]

In the above case 'A' is a string which is being stored and using unpack I am trying to read the bit value. The ASCII table says that ASCII valule of 'A' is 65 and the binary representation of 65 is 10000010 .

Here is another example.

> 'A'.unpack('B*')
=> ["01000001"]

Notice the difference in result from the first case. What's the difference between b* and B*. In order to understand the difference first lets discuss MSB and LSB.

Most significant bit vs Least significant bit

All bits are not created equal. C has ascii value of 67. The binary value of 67 is 1000011.

First let's discuss MSB (most significant bit) style . If you are following MSB style then going from left to right (and you always go from left to right) then the most significant bit will come first. Because the most significant bit comes first we can pad an additional zero to the left to make the number of bits eight. After adding an additional zero to the left the binary value looks like 01000011.

If we want to convert this value in the LSB (Least Significant Bit) style then we need to store the least significant bit first going from left to right. Given below is how the bits will be moved if we are converting from MSB to LSB. Note that in the below case position 1 is being referred to the leftmost bit.

move value 1 from position 8 of MSB to position 1 of LSB
move value 1 from position 7 of MSB to position 2 of LSB
move value 0 from position 6 of MSB to position 3 of LSB
and so on and so forth

After the exercise is over the value will look like 11000010.

We did this exercise manually to understand the difference between most significant bit and least significant bit. However unpack method can directly give the result in both MSB and LSB. The unpack method can take both b* and B* as the input. As per the ruby documentation here is the differnce.

B | bit string (MSB first)
b | bit string (LSB first)

Now let's take a look at two examples.

> 'C'.unpack('b*')
=> ["11000010"]

> 'C'.unpack('B*')
=> ["01000011"]

Both b* and B* are looking at the same underlying data. It's just that they represent the data differently.

Different ways of getting the same data

Let's say that I want binary value for string hello . Based on the discussion in the last section that should be easy now.

> "hello".unpack('B*')
=> ["0110100001100101011011000110110001101111"]

The same information can also be derived as

> "hello".unpack('C*').map {|e| e.to_s 2}
=> ["1101000", "1100101", "1101100", "1101100", "1101111"]

Let's break down the previous statement in small steps.

> "hello".unpack('C*')
=> [104, 101, 108, 108, 111]

Directive C* gives the 8-bit unsigned integer value of the character. Note that ascii value of h is 104, ascii value of e is 101 and so on.

Using the technique discussed above I can find hex value of the string.

> "hello".unpack('C*').map {|e| e.to_s 16}
=> ["68", "65", "6c", "6c", "6f"]

Hex value can also be achieved directly.

> "hello".unpack('H*')
=> ["68656c6c6f"]

High nibble first vs Low nibble first

Notice the difference in the below two cases.

> "hello".unpack('H*')
=> ["68656c6c6f"]

> "hello".unpack('h*')
=> ["8656c6c6f6"]

As per ruby documentation for unpack

H | hex string (high nibble first)
h | hex string (low nibble first)

A byte consists of 8 bits. A nibble consists of 4 bits. So a byte has two nibbles. The ascii value of 'h' is 104. Hex value of 104 is 68. This 68 is stored in two nibbles. First nibble, meaning 4 bits, contain the value 6 and the second nibble contains the value 8. In general we deal with high nibble first and going from left to right we pick the value 6 and then 8.

However if you are dealing with low nibble first then low nibble value 8 will take the first slot and then 6 will come. Hence the result in "low nibble first" mode will be 86.

This pattern is repeated for each byte. And because of that a hex value of 68 65 6c 6c 6f looks like 86 56 c6 c6 f6 in low nibble first format.

Mix and match directives

In all the previous examples I used *. And a * means to keep going as long as it has to keep going. Lets see a few examples.

A single C will get a single byte.

> "hello".unpack('C')
=> [104]

You can add more Cs if you like.

> "hello".unpack('CC')
=> [104, 101]

> "hello".unpack('CCC')
=> [104, 101, 108]

> "hello".unpack('CCCCC')
=> [104, 101, 108, 108, 111]

Rather than repeating all those directives, I can put a number to denote how many times you want previous directive to be repeated.

> "hello".unpack('C5')
=> [104, 101, 108, 108, 111]

I can use * to capture al the remaining bytes.

> "hello".unpack('C*')
=> [104, 101, 108, 108, 111]

Below is an example where MSB and LSB are being mixed.

> "aa".unpack('b8B8')
=> ["10000110", "01100001"]

pack is reverse of unpack

Method pack is used to read the stored data. Let's discuss a few examples.

>  [1000001].pack('C')
=> "A"

In the above case the binary value is being interpreted as 8 bit unsigned integer and the result is 'A'.

> ['A'].pack('H')
=> "\xA0"

In the above case the input 'A' is not ASCII 'A' but the hex 'A'. Why is it hex 'A'. It is hex 'A' because the directive 'H' is telling pack to treat input value as hex value. Since 'H' is high nibble first and since the input has only one nibble then that means the second nibble is zero. So the input changes from ['A'] to ['A0'] .

Since hex value A0 does not translate into anything in the ASCII table the final output is left as it and hence the result is \xA0. The leading \x indicates that the value is hex value.

Notice the in hex notation A is same as a. So in the above example I can replace A with a and the result should not change. Let's try that.

> ['a'].pack('H')
=> "\xA0"

Let's discuss another example.

> ['a'].pack('h')
=> "\n"

In the above example notice the change. I changed directive from H to h. Since h means low nibble first and since the input has only one nibble the value of low nibble becomes zero and the input value is treated as high nibble value. That means value changes from ['a'] to ['0a']. And the output will be \x0A. If you look at ASCII table then hex value A is ASCII value 10 which is NL line feed, new line. Hence we see \n as the output because it represents "new line feed".

Usage of unpack in Rails source code

I did a quick grep in Rails source code and found following usage of unpack.

email_address_obfuscated.unpack('C*')
'mailto:'.unpack('C*')
email_address.unpack('C*')
char.unpack('H2')
column.class.string_to_binary(value).unpack("H*")
data.unpack("m")
s.unpack("U*")

Already we have seen the usage of directive C* and H for unpack. The directive m gives the base64 encoded value and the directive U* gives the UTF-8 character. Here is an example.

> "Hello".unpack('U*')
=> [72, 101, 108, 108, 111]

Testing environment

Above code was tested with ruby 1.9.2 .

French version of this article is available here .

 
Neeraj Singh's profile picture

Comments