Perl Notes

This is a summary of the presentation I gave at the Warp UK user group meeting on July 5th 2003 and at Warpstock Europe in November 2004 and in November 2009.

You can download perl 5.8.0 from Hobbes.

Installation

Installation is just a case of following the readme ;-) - But some points to note:

You need the EMX dll's - does anyone not have them these days? :-)
If you are going to install perl modules from CPAN then you will need gcc & GNU make.

There are a couple of config.sys statements you may need. Depending on how perl was built it might have complied into internal scripts, the drive letter of the drive the install was done to in addition to the path. Such as e:/usr/lib/perl/lib - if the builder is real smart he will build it as /usr/lib/perl/lib. In the second case you can move the whole directory tree to what ever drive you like and it will work. In the first case you need the following environmental variable to change all occurrences of where it was built to where it is now installed on the fly.


PERLLIB_PREFIX=e:/usr/lib/perl/lib;X:\usr\lib\perl\lib

Where X: is, as usual, the drive you have /usr/perl/lib installed on.

Perl will also be looking for a *nix type shell to run things such as system calls in, so the second config.sys statement you may need is


PERL_SH_DIR=X:\BIN

Which tells perl where to find a shell executable.

Introduction

Perl is yet another scripting language like REXX, python etc. Like REXX it was invented by one man, Larry Wall, and his book "Programming Perl" is a very good introduction. Known by all perl addicts as "The camel book" it is published by O'Reilly. Who, incidently, put a lot of support into perl. They also publish some very handy "Pocket References" on various subjects including perl and HTML. Not all bookstores stock them though.

Officially perl is the "Practical Extraction and Report Language", unofficially it has been referred to as the "Pathologically Eclectic Rubbish Lister". The perl motto is TMTOWTDI - There's More Than One Way To Do It - as, given more than one perl programmer and a task, they will all come up with different code. All of which works and is "correct".

Those of you who know REXX will know that the OS/2 command interpreter will run as a batch file any file ending with .cmd - If that file starts with a /* */ type comment line then it will pass the rest of the script to the REXX interpreter. However, it does not know about perl. There are two ways around this. You can have perl.exe in you path so that you can type perl <name of perl script> (Note perl scripts are normally suffixed .pl). The other is to suffix the script with .cmd and put extproc perl as the first line of the script. With this method the OS/2 command processor loads perl and passes it the rest of the script. The main disadvantage of the extproc approach is the script will no longer run on a *nix box and intelligent editors will not know they are editing perl, which they do from the .pl suffix.

Because I write and run scripts on both OS/2 and unix, and like my editor to know I am working with perl and not REXX, which most default to if they see .cmd, all the following examples are *nix style. They will work in OS/2 if you cut and paste them and invoke with:

perl <name of perl script>.

So helloworld.cmd looks like:


extproc perl
print "Hello World\n";

and is run by just typing helloworld.

helloworld.pl on the other hand looks like:


print "Hello World\n";

and is run by typing perl helloworld.pl

Both produce the same result - the string "Hello World" followed by a line feed. (That's the \n for you non C types.)

Before we go any further, I must mention the shebang line. Shebang is hallowed unix speak for the first line of a script that starts #!. What follows is the program to execute the rest of the script. So perl scripts on unix tend to start


#!/usr/local/bin/perl -w

Or whatever the path to the perl executable is. The -w above is a switch to tell perl to turn on warning messages. Putting it in the script makes sure we don't forget any switches. This works on OS/2 in so far as the switches are obeyed. Like all good unix programmes perl has a lot of command line switches. The only one you may need on the command line itself is -T which turns on "taint mode" - more of that later. Note that you can have a shebang line to set switches and use extproc to run the script. They are not mutually exclusive.

Basics

Now you know how to run perl scripts lets take a look at the syntax. Perl statements are terminated by a semicolon (;) and because of this they can cross physical lines. Unless inside quoted strings, whitespace is generally not significant. Anything after a hash (#) sign is taken as comments to the end of the current input line. How long should a line be? Keep it readable. Traditionalists will use 72 character lines harking back to 80 column punch cards where cc73-80 were used to sequence number the deck - they often used to get dropped!

So you can't have a block of comments REXX style as:


/* some comments
   and more comments
   ending here
*/

You have to code them like this:


# some comments
# and more comments
# ending here
#

Unlike REXX, perl is a "Data Typed" language. This means you have to tell perl what type of variable each variable is. For this article we will only consider scalar, array and hash types. Scalars start with a $ sign, arrays with an @ sign and hashes with a % sign. A variable can hold any type of data - string, number etc. This is also true of arrays. You can have an array where element 0 (perl starts indexing from 0 BTW) is a number, element 1 is a string and element 2 a reference to something else. We are not going into references in this article, for now just take it as read that an array element can contain just about anything - including another array. As indeed can scalars.

Let's look at some code:


#!/usr/local/bin/perl -w

# scalar1.pl

$thing = 0;
# single quotes stop substitution

print 'thing = $thing\n';
print "\n";

# double quotes allow substitution

print "thing = $thing\n";
{
  $somethingelse = 1;
  $thing++; # this is a quick way of incrementing by one.
  print "thing = $thing somethingelse = $somethingelse\n";
}

print "thing = $thing somethingelse = $somethingelse\n";

Running this as perl scalar1.pl gives:


thing = $thing\n
thing = 0
thing = 1 something else = 1
thing = 1 something else = 1

Two points here - putting something in single quotes stops substitution unlike REXX where what you quote with does not matter as long as they match. If you need to put the quote character inside the quoted string, escape it with a backslash (\) "She said \"Oh dear\"." The second point is the wiggly braces. These denote a "code block" and are usually found after IF's WHILE's etc. The point here is that a variable inside the {} is not always the same as the same named variable outside the {}. Technically this is known as the "scope" of the variable. Lets hack the code around and see what happens.


#!/usr/local/bin/perl -w

# scalar2.pl

$thing = 0;

print "\n";
print "thing = $thing\n";
$somethingelse = 0;
{
  $somethingelse = 1;
  $thing++; # this is a quick way of incrementing by one.
  print "thing = $thing somethingelse = $somethingelse\n";
}

print "thing = $thing somethingelse = $somethingelse\n";

Running this as perl scalar2.pl gives:


thing = 0
thing = 1 something else = 1
thing = 1 something else = 1

The variable $somethingelse is declared before the code block so exists inside the block and afterwards. Now lets make it local inside the block. We do that by prefixing with "my"


#!/usr/local/bin/perl

# scalar3.pl

#use strict;
#use warnings;

$thing = 0;

print "\n";
print "thing = $thing\n";
$somethingelse = 0;

#  lots and lots of code

$somethingelse = 1;

# this code block could be an if or a loop
{
  my $somethingelse = 2;
  $thing++; # this is a quick way of incrementing by one.
  print "thing = $thing somethingelse = $somethingelse\n";
}

print "thing = $thing somethingelse = $somethingelse\n";

Running this as perl scalar3.pl gives:


thing = 0
thing = 1 something else = 2
thing = 1 something else = 1

Now see what happens? There are now two variables called $somethingelse one inside the {} and one outside. Obviously here is scope for great confusion so there is a bit of magic perl can help us with. use strict; If we put that at the start of our script then every variable will need to be declared with "my" - but error messages will be issued whenever which copy of a variable to use is open to question. I advise you to always use it.

Cut and paste the above into an editor and then try it. Uncomment the use statements, removing the lines with $somethingelse outside the code block and without the "my" inside the code block. Try combinations of these.

Equal and Equal compared

Perl uses different operators for testing numbers and strings.
It is easy to remember which as numbers use symbols and strings use letters.

Operation	On Numbers	On Strings
Less than	<	lt
Less than or equal	<=	le
Equal	==	eq
Not equal	!=	ne
Greater than or equal	>=	ge
Greater than	>	gt
Compare *	<=>	cmp

* <=> is known as the "spaceship operator", both types of compare return -1, 0, +1, for less than, equal to & greater than. Note that you might get away with using the wrong type of operator without getting an error, but the result will certainly not be what you want or expect.

Arrays

Now lets have a look at arrays. Array names start with an @ sign. They are 0 indexed and individual elements are referenced by $arrayname[element #]Arrays are really lists and operate in "list context" however if they are referenced in scalar context they return the size of the array:


#!/usr/local/lib/perl 

# array.pl

use strict;
use warnings;

my @array;

$array[0] = 1;
$array[1] = 'thing';
$array[2] = 3;

print "@array \n";
print $#array."\n";
my $i = @array;
print "$i \n";

Results in


1 thing 3
2
3

Hashes

Hashes on the other hand are much more fun. They start with a % sign and individual elements are referenced by $hashname{element_name} The element name is more correctly known as the "key".

Note that:

The key can be a number or a string.
The order of the elements in the hash is unpredictable.

They can be sorted however, either by key or by value. The following example shows this.


#!/usr/local/lib/perl -w

# hash4.pl

use strict;

my %hash;
my $thing;
my $key;

# define a hash with keys in jumbled order
#
$hash{'a string'} = 'zzzzzzzz';
$hash{'b string'} = 'string';
$hash{'c string'} = 'string';
$hash{'12345678'} = 'another string';

# the hash is stored as a list
# of key/value pairs
# this iterates over the list

foreach $thing (%hash)
{
  print "$thing\n";
}

# the above printed keys and values on separate lines.
# to get key/value pairs we must tell it to only process the keys thus

print "\n\n";

foreach $key (keys %hash)
{
  print "Key: \"$key\" Value: \"$hash{$key}\"\n";
}

print "\n\n";

# they still print in a random order though so we sort the keys

foreach $key (sort keys %hash)
{
  print "Key: \"$key\" Value: \"$hash{$key}\"\n";
}
# and here we sort by value. hash_by_value is a subroutine invoked by
# the sort process - see below for more details
# 

print "\n\n";

foreach $key (sort hash_by_value keys %hash)
{
  print "$key $hash{$key}\n";
}
exit;

sub hash_by_value
{
  # sort calls this routine with two values $a and $b
  # the code has to tell sort which order they should be in
  # it does this by returning:
  # -1 if $a is before $b
  # 0 if they are equal
  # +1 if $b is before $a
  # if the values we are sorting on are numeric we use the <=> operator
  # if the values are strings we use the cmp operator
  # the || syntax runs if the left hand result is 0
  # so here we are sorting on the string value of the hash and if the
  # values are equal we sort on the key
  
  $hash{$a} cmp $hash{$b} || $a cmp $b;

  # we don't need to return anything as 
  # 1) perl supplies a return by default
  # 2) if a return value is not specified perl returns the
  # value of the last expression
}

Results in:


b string
string
12345678
another string
c string
string
a string
zzzzzzzz


Key: "b string" Value: "string"
Key: "12345678" Value: "another string"
Key: "c string" Value: "string"
Key: "a string" Value: "zzzzzzzz"


Key: "12345678" Value: "another string"
Key: "a string" Value: "zzzzzzzz"
Key: "b string" Value: "string"
Key: "c string" Value: "string"


Key: "12345678" Value: "another string"
Key: "b string" Value: "string"
Key: "c string" Value: "string"
Key: "a string" Value: "zzzzzzzz"

True and False

Perl has a slightly odd definition of true and false.

A number of value zero is false, all other numbers are true.
An empty string ("") or a string containing zero ("0") is false, all other strings are true.
Any undefined value is false.

An undefined value is a variable that has no value - perl does not default values when you declare the variable as some other languages do. REXX for example defaults the value of a variable to it's name in upper case.

So given the following code snippit


	my $var1 = 3;
	my $var2;

Then $var1 has the value 3 and $var2 has no value and is undefined. Note that undefined is not zero - it is no value. Variables may be made undefined. Why would you want to? Well many modules return undefined if there is an error or no data to return. You can test for definition thus:


	if ( defined $var ) # test if $var has a value

Or conversely


	if ( ! defined $var ) # test if $var has no value

Undefined is similar in concept, and as confusing, as NULL in database speak.

Regular Expressions

Perl implements powerful regular expressions and here are a couple of examples.

I had to parse some XML consisting of employee records.


<data>
	<EmployeeProfile>
		<name>Fred Flintstone</name>
		<town>Bedrock</town>
		<spouse>Wilma</spouse>
	</EmployeeProfile>
	<EmployeeProfile>
		<name>Barny Rubble</name>
		<town>Bedrock</town>
		<spouse>Betty</spouse>
	</EmployeeProfile>
</data>

The data was around 120,000 records with many more key/value pairs. However, there was no deeper nesting of keys and no attributes to worry about. Also, the API I was going to use to pass the data to another application that did not understand XML took a key/value hash as one of its parameters. Handy :-) Now the code:


use strict;
use warnings;

my %inputkeyvalue;

open EMPS, "<emp.xml" or die "Can't open input data $!\n";

$/ = "<\/EmployeeProfile>"; # change line ending

while (  )
{
  (%inputkeyvalue) = m/<(\w+)>(.*)<\/\1>/g;

  foreach my $key (sort keys %inputkeyvalue)
  {
    print "$key $inputkeyvalue{$key}\n"
  }

  print "\n";
}

So how does it work? First I change what perl thinks of as line end. This lives in a special variable "$/" by setting

$/ = "<\/EmployeeProfile>";

I can read the whole of each employee record in one chunk. Then inside a loop that reads those chunks I use

(%inputkeyvalue) = m/<(\w+)>(.*)<\/\1>/g;

Now we will take that apart. m/........./ is the match operator what it is looking for is a left angle bracket followed by one or more word characters \w+ followed by a right angle bracket. That in turn is followed by zero or more characters of anything .* followed by a left angle bracket, a slash, "whatever we found on the first match [grouped by the ()] and a trailing right angle bracket. The ()'s group things so you can refer to them later. So given some XML like

<sometag>some value</sometag>

(\w+) will be "sometag", (.*) will be "some value" and it will look for an end tag of "</sometag>"
The g on the end says do it globally, that is match as many times as possible. A by product of this is that it will return a list of whatever is in the () pairs for each match. By forcing that into a hash

(%inputkeyvalue) =

we end up with all the key/value pairs per employee record from the XML in a perl hash in one pass of the data. Running with the XML as stdin:


name Fred Flintstone
spouse Wilma
town Bedrock

name Barny Rubble
spouse Betty
town Bedrock

Now that was all I needed, but just in case someone asks "How would you sort that?" We need to get into "references". I don't want to get into that here in any detail, but simply put a reference can be thought of as a pointer to something rather than the something itself.



use strict;
use warnings;

my %inputkeyvalue;
my @emps; # an array to hold hashes

open EMPS, "<emp.xml" or die "Can't open input data $!\n";

$/ = "<\/EmployeeProfile>"; # change line ending

while (  )
{
  push @emps , {}; # put pointer to an anonymous hash on end of array
  (%{$emps[-1]})  = m/<(\w+)>(.*)<\/\1>/g; # fill that hash
  delete $emps[-1] if ! keys %{$emps[-1]}; # drop array element if empty. ie last 
}  
  
foreach $_ (sort by_name @emps)
{
  foreach my $key ( sort keys %{$_})
  {
    print "$key ${$_}{$key}\n";
  }
  print "\n";
}  

exit 0;

sub by_name
{
  ${$a}{'name'} cmp ${$b}{'name'};
}

Giving:


name Barny Rubble
spouse Betty
town Bedrock

name Fred Flintstone
spouse Wilma
town Bedrock

The second example was from my web site that contains a searchable archive of documents. After it went live some bright spark said it would be really cool if, having opened a document the search had turned up, the searched for text was highlighted. I suddenly realised that if the search results screen held a few extra pieces of information and if I used a button to view the document rather than a standard link I could do it. The essential bits of the perl cgi script that pressing the button invoked where $ss contains the search string and NEWSLETTER is an open file handle pointing to the document.


undef $/; # This undefines the line ending so it will read the 
          # whole of the document in one go.
$_ = <NEWSLETTER>; # read the lot - the <> operator is i/o on handle
s/($ss)/\<font color="#ff0000"\>$1\<\/font\>/gis;

Taking it apart, s/....../ is the search and replace operator. If you don't tell it what to operate on it defaults to $_ (So does match BTW). So it finds the search string $ss and replaces that with itself $1, surrounded with font end-font tags (Thank's Brian!) to set the colour to red. It does this globally g, case insensitively and treats the whole string as one line s. So in one pass we have highlighted in red every occurrence of the search string. Pretty neat huh?

Taint mode

When you use perl for cgi scripts someone will try and hack your web server by breaking the scripts with bad data. Perl has a taint mode that is set by the -T command line switch and perl will then whinge about any unsafe use of data that has been obtained from the web source, amongst other places. You have to validate and "clean" such values using regular expressions before using them. An example might be an email address. Now it would be reasonable to assume that the email address be used in a reply. This could be done by running a system command like
`mail email-address message`
Perl would normally run this in a shell as the userid the web server was running as. Now consider if the "email address" consisted of the command line command separator character and then some nasty command like "rm -R /*" Ouch in spades. Now perhaps you see why taint mode is a good thing tm.

Finally

And finally, how can anyone not like a language that boasts an unless statement?

Feedback to Dave Saville Last modified: Monday, 2 November 2009 19:32:22