13String and Binary

13.1Overview

A string is a sequence of character codes in UTF-8 format and is represented by string class. Class string is a primitive type, which means there's no operation that could modify the content of string instances. This leads to the following principles:

  • It's not allowed to edit each character in a string content through index access.
  • Modification methods are supposed to return a new string instance with modified result.

The interpreter itself provides fundamental operations for strings. Importing module named re expand the capability so that it can process string data using regular expressions.

Meanwhile, a binary is a byte sequence of data that has any format and is represented by binary class. Class binary is an object type, so you can modify the content of the instance. A binary instance can be used as a plain memory image capable of containing any data.

13.2Operation on String

13.2.1Character Manipulation

You can specify an index number starting from zero embraced by a pair of square brackets to retrieve a character as a sub string at the specified position. Multiple numbers for indexing can also be specified to get a list of sub strings.

str = 'abcdefghijklmnopqrstuvwxyz'
str[6]            // returns 'g'
str[20]           // returns 'u'
str[17]           // returns 'r'
str[0]            // returns 'a'
str[6, 20, 17, 0] // returns ['g', 'u', 'r', 'a']

You can also specify iterators and lists to get a list of sub strings. Numbers and iterators can be mixed together as indexing items.

str = 'The quick brown fox jumps over the lazy dog'
str[10..14]       // returns ['b', 'r', 'o', 'w', 'n']
str[4..8, 35..38] // returns ['q', 'u', 'i', 'c', 'k', 'l', 'a', 'z', 'y']

If you specify an infinite iterator as an indexing item, you would get sub strings within an available range.

str = 'The quick brown fox jumps over the lazy dog'
str[35..]       // returns ['l', 'a', 'z', 'y', ' ', 'd', 'o', 'g']

An index with a negative number points the position from the bottom, where -1 is the last position.

str = 'The quick brown fox jumps over the lazy dog'
str[-3]         // returns 'd'
str[-2]         // returns 'o'
str[-1]         // returns 'g'

Function chr() returns a string that contains a character of the given UTF-8 character code.

chr(65)         // returns 'A'

Function ord() takes a string and returns UTF-8 character code of its first character.

ord('A')        // returns 65

13.2.2Iteration

Method string#each() creates an iterator that returns each character as a sub string.

str = 'The quick brown fox jumps over the lazy dog'
x = str.each()
// x is an iterator that returns 'T', 'h', 'e' ...

A call of string#each() with attribute :utf8 or :utf32 would create an iterator that returns character code numbers in UTF-8 or UTF-32 instead of sub strings.

str = 'XXX'  // assumes it contains kanji characters 'ni-hon-go'
x = str.each():utf8
// x is an iterator that returns 0xe697a5, 0xe69cac and 0xe7aa9e

x = str.each():utf32
// x is an iterator that returns 0x65e5, 0x672c and 0x8a9e

Method string#eachline() creates an iterator that splits a string by a newline character and returns strings of each line.

str = R'''
1st
2nd
3rd
'''
lines = str.eachline()
// lines is an iterator that returns '1st\n', '2nd\n' and '3rd\n'

Method string#chop() is useful when you want to remove a newline character appended at the bottom.

x = str.eachline()
lines = x:*chop()  // an iterator to apply string#chop() to each value in x
// lines is an iterator that returns '1st', '2nd' and '3rd'

Method string#eachline() and others that split a multi-lined text into strings of each line like readlines() are equipped with an attribute :chop that applies the same process as string#chop().

lines = str.eachline():chop
// lines is an iterator that returns '1st', '2nd' and '3rd'

Method string#split() creates an iterator that splits a string by a separator string specified in the argument.

str = 'The quick brown fox jumps over the lazy dog'
x = str.split(' ')
// x is an iterator that returns 'The', 'quick', 'brown', 'fox' ...

If you want to split a string into segments with the same length, use string#fold() method.

str = 'abcdefghijklmnopqrstuvwxyz'
x = str.fold(5)
// x is an iterator that returns 'abcde', 'fghij', 'klmno', 'pqrst', 'uvwxy' and 'z'

13.2.3Modification and Conversion

Applying an operator + between two string instances would concatenate them together.

str1 = 'abcd'
str2 = 'efgh'
str1 + str2   // returns 'abcdefgh'

An operator * between a string and a number value would concatenate the string the specified number of times.

str = 'abcd'
str * 3      // returns 'abcdabcdabcd'

Method list#join() joins all the string in the list and returns the result. If it contains elements other than string, they're converted to strings before joined.

['abcd', 'efgh', 'ijkl'].join()    // returns 'abcdefghijkl'

The method can take a separator string as its argument that is inserted between elements.

['abcd', 'efgh', 'ijkl'].join(', ') // returns 'abcd, efgh, ijkl'

Method string#capitalize() returns a string with the top alphabet converted to uppper case.

str = 'hello, WORLD'
str.capitalize()  // returns 'Hello, WORLD'

Methods string#upper() and string#lower() return a string after converting all the alphabet characters to upper and lower case respectively.

str = 'hello, WORLD'
str.upper()       // returns 'HELLO, WORLD'
str.lower()       // returns 'hello, world'

Method string#binary() returns a binary instance that contains a binary sequence of the string in UTF-8 format.

str = 'XXX'    // assumes it contains kanji characters 'ni-hon-go'
str..binary()  // returns a binary b'\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'

You can use string#encode() to get a binary sequence in other codec other than UTF-8.

str = 'XXX'              // assumes it contains kanji characters 'ni-hon-go'
str.encode('shift_jis')  // returns a b'\x93\xfa\x96\x7b\x8c\xea'

Method string#reader() returns a stream instance that reads a binary sequence of the string in UTF-8 format.

str = 'The quick brown fox jumps over the lazy dog'
x = str.reader()
// x is a stream instance for reading

Method string#encodeuri() converts characters that can not be described in URI by a percent-encoding rule, while method string#decodeuri() converts such encoded string into normal characters.

Method string#escapehtml() escapes characters that can not be described in HTML with character entities prefixed by an ampersand, while method string#unescapehtml()converts such escaped ones into normal characters.

13.2.4Extraction

Method string#strip() removes space characters that exist on both sides of the string. Attributes :left and :right would specify the side to remove spaces.

str = '    hello  '
str.strip()        // returns 'hello'
str.strip():left   // returns 'hello  '
str.strip():right  // returns '    hello'

Method string#left() returns a sub string that has extracted specified number of characters from the left side, while method string#right()extracts from the right side.

str = 'The quick brown fox jumps over the lazy dog'
str.left(3)  // returns 'The'
str.right(3) // returns 'dog'

Method string#mid() returns a sub string that has extracted specified number of characters from the specified position.

str = 'The quick brown fox jumps over the lazy dog'
str.mid(10, 5)  // returns 'brown'

13.2.5Search, Replace and Inspection

To see the length of a string, string#len() is available. Note that string#len() returns the number of characters, not the size in byte.

str = 'abcdefghijklmnopqrstuvwxyz'
n = str.len()
// n is 26

Method string#find() searches the specified sub string in the target string and returns the found position starting from zero. If not found, it returns nil.

str = 'The quick brown fox jumps over the lazy dog'
str.find('fox')  // returns 16
str.find('cat')  // returns nil

Method string#replace() replaces the sub string with the specified one.

str = 'The quick brown fox jumps over the lazy dog'
str.replace('fox', 'cat') // returns 'The quick brown cat jumps over the lazy dog'

Method string#startswith() returns ture if the string starts with the specified sub string, and returns false otherwise. Method string#endswith() checks if the string ends with the specified sub string.

str = 'abcdefghijklmnopqrstuvwxyz'
str.startswith('abcde') // returns true
str.startswith('hoge')  // returns false
str.endswith('vwxyz')   // returns true
str.endswith('hoge')    // returns false

Specifying an attribute :rest indicates that these functions return a string excluding the specified sub string when that matches the head or the bottom part. If the sub string doesn't match, they would return nil.

str.startswith('abcde):rest // returns 'fghijklmnopqrstuvwxyz'
str.startswith('hoge'):rest // returns nil
str.endswith('vwxyz'):rest  // returns 'abcdefghijklmnopqrstu'
str.endswith('hoge'):rest   // returns nil

13.3Formatter

13.4Functions Equipped with Formatter

You can use format specifiers in some functions that are similar to what are realized in C language's printf to convert objects like numbers into readable strings.

Function printf() takes a string containing format specifiers and values you want to print in its argument list and put the result out to sys.stdout stream.

printf('x = %d, y = %d\n', x, y)

Method stream#printf() has the same argument declaration with printf() and puts the result to the target stream capable of writing instead of sys.stdout stream.

open('foo.txt', 'w').printf('x = %d, y = %d\n', x, y)

Method list#printf() is another form of printf(), which takes values to print in the list of the target instance, not in the argument list.

[x, y].printf('x = %d, y = %d\n')

Function format() takes arguments in the same way as printf() but it returns the result as a string instance.

str = format('x = %d, y = %d\n', x, y)

You can also use % operator to get the same result with format(), which takes a format string on the left and a list containing values for printing on the right.

str = 'x = %d, y = %d\n' % [x, y]

If there's only one value for printing, you can even give the operator the value without a list.

str = 'x = %d\n' % x

13.5Syntax of Format Specifier

A format specifier begins with a percent character and has the syntax below, where optional fields are embraced by square brackets:

%[flags][width][.precision]specifier

You always have to specify one of the following characters for the specifier field.

specifier Note
d, i decimal integer number with a sign mark
u decimal integer number wihout a sign mark
b binary integer number without a sign mark
o octal integer number without a sign mark
x hexadecimal integer number in lower character without a sign mark
X hexadecimal integer number in upper character without a sign mark
e floating number in exponential form
E floating number in exponential form (in upper character)
f floating number in decimal form
g better form between e and f
G better form between E and F
s string
c character

You can specify one of the following characters for the optional flags field.

flags Note
+ + precedes for positive numbers
- adjust a string to left
(space) space character precedes for positive numbers
# converted results of binary, octdecimal and hexadecimal are preceded by '0b', '0' and '0x' respectively
0 fill lacking columns with '0'

The optional field width takes a decimal number that specifies a minimum width for the corresponding value. If the value's length is shorter than the specified width, the rest would be filled with space characters. If you specify * for that field, the formatter would try to get the minimum width from the argument list.

The optional field precision has different meanings depending on the specifier as below:

specifier Note
d, i, u, b, o, x, X It specifies the minimum number of digits. If the value is shorter than this number, lacking digits are filled with zero.
e, E, f It specifies the number of digits after a decimal point.
g, G It specifies the maximum number of digits for significand part.
s It specifies the maximum number of characters to print.

13.6Regular Expression

You can import module re to use regular expression for string search and substition, which supports a syntax based on POSIX Extended Regular Expression.

Importing module re would equip string class with methods that can handle regular expressions. See the sample code below:

import(re)

str = '12:34:56'

m = str.match(r'(\d\d):(\d\d):(\d\d)')
if (m) {
    printf('hour=%s, min=%s, sec=%s\n', m[1], m[2], m[3])
} else {
    println('not match')
}

Method string#match() that is provided by re module takes a regular expression pattern. It would return re.match instance if the pattern matches, and return nil otherwise. As regular expressions often contain back slash as a meta character, it would be convenient to use an expression r'' for a pattern string to avoid recognizing a backslash as an escaping character.

An instance of re.match contains information about matching result. It supports indexing access where m[0] has a string that matches the whole pattern and m[1], m[2] … returns a string of each group. You can specify a string instead of a number to index each group when you use ?<name> specifier for the group in a regular expression pattern.

m = str.match(r'(?<hour>\d\d):(?<min>\d\d):(?<sec>\d\d)')
if (m) {
    printf('hour=%s, min=%s, sec=%s\n', m['hour'], m['min'], m['sec'])
} else {
    println('not match')
}

Although you can pass a string containing a pattern to method string#match(), it actually takes re.pattern instance in its argument that is capable of accepting a string instance through casting feature. Above example is equivalent with below:

pat = re.pattern(r'(\d\d):(\d\d):(\d\d)')
m = str.match(pat)

When you give a string to a function or a method that expects re.pattern, it always compile the string to create re.pattern instance, which may cause some overhead in a process of huge amount of data. In such a case, it may be a good idea to call a function with a re.pattern instance that has explicitly been created by re.pattern() function in advance like shown above.

Method string#sub() takes a re.pattern instance and replaces the matched part with the given substitution value.

A subsitution item can be either a string or a function. When you give a string for it, the method replaces the matched part with the string.

str = 'The quick brown fox jumps over the lazy dog'
str.sub(r'[Tt]he', 'THE') // returns 'THE quick brown fox jumps over THE lazy dog'

You can specify a group reference \n in a subsitution string where n indicates the group index.

If you specify a function for the substitution value, which takes a re.match value as its argument and to return a substitution string, it would be called when the matching succeeds.

str = '### #### ##### ## #####'
f(m:re.match) = format('%d', m[0].len())
str.sub('#+', f)                             // returns '3 4 5 2 5'

An anonymous function would make the code more simple.

str = '### #### ##### ## #####'
str.sub('#+', &{format('%d', $m[0].len())})  // returns '3 4 5 2 5'

13.7Operation on Binary

13.7.1Creation of Instance

You can create a binary instance by put a prefix b to a string literal.

b'AB\x01\x00\xff'

The example above is a binary instance that contains a sequence of byte data: 0x41, 0x42, 0x01, 0x00 and 0xff. As an instance created by a string literal prefixed by b can not be modified, it would occur an error when you try some modification operations on such an instance.

There are several ways to create a binary instance that accepts modification.

  • Constructor function binary() creates an empty binary instance.

      buff = binary()
    
  • Class method binary.alloc() creates a binary instance of the specified size.

      buff = binary.alloc(1000)
      // buff has a memory of 1000 bytes
    
  • Class method binary.pack() packs values into a binary sequence according to the packing specifier.

      buff = binary.pack('Bl', 0xaa, 0x12345678)
      // buff has a byte sequence: 0xaa, 0x78, 0x56, 0x34, 0x12.
    

You can use method binary#dump() to print out a content of a binary in a hexadecimal dump format.

13.7.2Byte Manipulation

An index access on a binary would return a number value at the specified position.

buff = b'\xaa\xbb\xcc\xdd\xee'
buff[0] // returns 0xaa
buff[1] // returns 0xbb

You can also specify an iterator as an indexing item for a binary just like a string.

buff[1..3] // returns [0xbb, 0xcc, 0xdd]

Modification through an indexer on a writable binary is also possible.

buff = binary.alloc(8)
buff[0] = 0x12
buff[1] = 0x34
buff[3..] = 0..4
// buff has a byte sequence: 0x12, 0x34, 0x00, 0x00, 0x01, 0x02, 0x03, 0x04.

Method binary#each() creates an iterator that returns each 8-bit number value in the binary.

buff = b'\xaa\xbb\xcc\xdd\xee'
x = buff.each()
// x is an iterator that returns 0xaa, 0xbb, 0xcc, 0xdd and 0xee.

13.7.3Pack and Unpack

Using an indexer and binary#each() method, you can retrieve and modify the content of a binary by a unit of 8-bit number. To store and extract values such as number that consits of multiple octets and string that contains a sequence of character codes, the following methods are provided.

  • Class method binary.pack() to create a binary sequence that contains numbers and strings.
  • Method binary#unpack() to extract numbers and strings from a binary sequence.

Class method binary.pack() takes a formatter string containing specifiers and values to store as its argument. For example:

rtn = binary.pack('H', 0x1234)

The specifier H means an unsigned 16-bit number, so the result rtn is a binary instance that contains a binary sequence of 0x34 and 0x12.

You can write any number of specifiers in the format.

rtn = binary.pack('HHH', 0x1234, 0xaabb, 0x5678)

The result contains a binary sequence of 0x34, 0x12, 0xbb, 0xaa, 0x78 and 0x56.

If there's a sequence of the same specifier like above, you can brackets them together by specifying the number ahead of that specifier.

rtn = binary.pack('3H', 0x1234, 0xaabb, 0x5678)

This has the same result as the previous example.

Meanwhile, method binary#unpack() takes a formatter string and returns a list containing unpacked result. For example:

buff = b'\x34\x12'
rtn = buff.unpack('H')

The result rtn is a list [0x1234]. Note that you always get a list as the result even when it contains only one value.

Below is an example of a format that contains multiple specifiers:

buff = b'\x34\x12\xbb\xaa\x78\x56'
rtn = buff.unpack('HHH')
// rtn is [0x1234, 0xaabb, 0x5678]

Just like the packing rule, you can specify the number of succeeding specifiers.

buff = b'\x34\x12\xbb\xaa\x78\x56'
rtn = buff.unpack('3H')

Using an assignment to lister expression may often be helpful, since you can assign extracted values to independent variables.

buff = b'\x34\x12\xbb\xaa\x78\x56'
[x, y, z] = buff.unpack('3H')

The table below summarizes specifiers that are used to pack or unpack number values.

Specifier Unit Size Note
b 1 byte Packs or unpacks a signed 8-bit number (-128 to 127).
B 1 byte Packs or unpacks an unsigned 8-bit number (0 to 255)
h 2 bytes Packs or unpacks a signed 16-bit number (-32768 to 32767)
H 2 bytes Packs or unpacks an unsigned 16-bit number (0 to 65535)
i 4 bytes Packs or unpacks a signed 32-bit number (-2147483648 to 2147483648)
I 4 bytes Packs or unpacks an unsigned 32-bit number (0 to 4294967295)
l 4 bytes Packs or unpacks a signed 32-bit number (-2147483648 to 2147483648)
L 4 bytes Packs or unpacks an unsigned 32-bit number (0 to 4294967295)
q 8 bytes Packs or unpacks a signed 64-bit number (-9223372036854775808 to 9223372036854775807)
Q 8 bytes Packs or unpacks an unsigned 64-bit number (0 to 18446744073709551615)
f 4 bytes Packs or unpacks a single precision floating point number.
d 8 bytes Packs or unpacks a double precision floating point number.

By default, byte order of numbers in 16-bit, 32-bit and 64-bit size is a little endian. You can change the order by using the following specifiers:

Specifier Note
@ Turns to a system-dependent endian.
= Turns to a system-dependent endian.
< Turns to a little endian.
> Turns to a big endian.
! Turns to a big endian.
rtn = binary.pack('H>H', 0x1234, 0x1234)
// rtn contains 0x34, 0x12, 0x12, 0x34.

Specifier x only advances pointer ahead for specified size without packing or unpacking of values. When packing, the skipped area would be filled with zero.

rtn = binary.pack('H3xH', 0x1234, 0x1234)
// rtn contains 0x34, 0x12, 0x00, 0x00, 0x00, 0x34, 0x12.

Specifiers c and s are prepared to pack or unpack string data.

Specifier Note
c Packs a first character code in a string, or unpack a 8-bit number as a chracter code and returns a string containing it.
s Packs character codes in a string according to the specified codec, or unpack 8-bit numbers as character codes according the specified codec and returns a string containing them.

You can specify a codec for s specifier by surrounding its name with { and }.

13.7.4Pointer

binary#pointer()

pointer#unpack()

pointer#pack()

13.7.5Binary as Stream

binary#writer()

binary#reader()

cast from binary to stream