Monday 24 September 2007

The Beast Remade: Do What I Mean

WWW::Mechanize is a subclass of LWP::UserAgent. LWP::UserAgent stops at the HTTP layer, and leaves the application to deal with HTML or whatever else the server sends us in its own way. WWW::Mechanize mixes in knowledge of HTML, and checks your HTTP conversations for error messages. (Now you know what the autocheck => 1 parameter in Enter WWW::Mechanize is.) So using WWW::Mechanize feels like scripted use of a browser, which fits my problem. I am trying, after all, to abbreviate user interaction from several mouse clicks per suspected spam to a short command-line interaction.

Intuitive methods like ->tick() for checkboxes, ->click() for buttons and ->select() for radio buttons and listboxes helps me fill out the necessary forms quickly. In my "rm" function, I deal with the spam like this:

foreach my $spamid (@spamid_arr)
{
$localmech->form_name("quarantine");
$localmech->tick("mailid", $spamid);
}
$localmech->click("delete");

From inspection, the relevant checkboxes had "name=mailselected" and "value=" an id number that stayed constant even when the mail queue view changed. Which makes sense, from a robustness perspective.

Monday 17 September 2007

The Beast Remade: Deep magic with Storable::dclone

The cat/head/rm/ok commands posed some difficulties. My naive "ls" implementation refreshed its view of the quarantine section every time you ran it, as a normal "ls" would.

The underlying web interface happened to add any new spam that arrived in the meantime to the top of the numbered list. So a certain spam might be renumbered every time my hack looked at the index page. I certainly did not want the numbering to refresh between calls to "ls"; that could cause a quick "ls; cat 1; rm 1" sequence to delete the wrong mail!

Both the deletion form and the "view mail" links, thankfully, used mail queue id as identifier, not the 1-20 numbered list. Keeping a cached copy of the deletion form seemed wise, for later use in the "rm" command. But I also wanted to make further use of the WWW::Mechanize object, in order to get a look at the suspected spam, and avoid any issues with multiple logins kicking each other out or similar.

Storable::dclone is deep Perl magic, cloning an instantiated object and all its references, including internal state, functions and all. As such, it is not exactly lightweight or easy to wrap your head around. But it does exactly what I want in this case; it gives me a separate copy of the $mech WWW::Mechanize object to play around with that does not disturb the original:

unless (defined ($Storable::Deparse))
{ $Storable::Deparse = 1; }
unless (defined ($Storable::Eval ))
{ $Storable::Eval = 1; }
my $localmech = dclone($main::mech);

... and then $localmech is mine to do with as I like. Below is a complete example of use. It extracts the spam from its detail page, pretty-prints it to a string and returns.


# clone a $mech for viewing, keep
# the old around for the index page
sub grabmsg
{
my $spamindex = $_[0];

unless (defined ($Storable::Deparse))
{ $Storable::Deparse = 1; }
unless (defined ($Storable::Eval ))
{ $Storable::Eval = 1; }
my $localmech = dclone($main::mech);

$localmech->get($spam_viewurl[$spamindex]);
my $content = decode("utf-8",$localmech->content);

my $extractor = new HTML::TableExtract(
depth => 1, count => 1);
$extractor->parse($decoded);

my $spam = "";
my $table = $extractor->first_table_found();
my @headerlist = qw(Received: Return-Path: Date:
From: Reply-To: To: CC:
Subject: Attachments);

for (my $count = 0; $count<@headerlist; $count++)
{
if (defined (my $cell = $table->cell($count,1)))
{
$spam .= color ('BOLD') . $headerlist[$count]
. color ('RESET') . "\n" ;
$spam .= "$cell\n" ;
}
}
$spam .= color ('BOLD') . "Mail body:"
. color ('RESET') . "\n" ;
$spam .= $table->cell(10,0) . "\n" ;

return $spam;
}

The Beast Remade: The lay of the land

HTML::TableExtract emerged as my poison of choice for doing the grunt work of HTML parsing. It was the easiest to improvise extraction criteria with, of the three or four Perl modules I tried in the course of a frustrating morning.

The easiest way to pick out the table of interest from the quarantine main menu, for example, turned out to be:

my $te = new HTML::TableExtract(
attribs => { width => 525 }, keep_html => 1 );
$te->parse($mech->content);

...since the attribute "width=525" was its only distinguishing feature. Once parsed, however, it was the work of moments to make a "cd" command to enter the quarantine section of interest. An "ls" command followed shortly behind, with momentary difficulties only.

Thursday 13 September 2007

The Beast Remade: Term::Shell Basics

Grabbing and parsing the web pages are all very well, but without a fast way to interact with them this is all just a waste of time and effort. Term::Shell by Neil Watkiss has been at version 0.01 for half a decade (woo! recently updated to 0.02!), and is barely mentioned in a Google search. That does not deter a Perl-familiar sysadmin out to improve his daily routine. A non-interactive critical script would be another matter entirely.

Term::Shell is excellent scaffolding and support when you do not want to muck around with Term::Readline yourself. As long as you have one of the Term::Readline:: alternatives installed, you get command-line editing and history for free.

A demonstration:

knan@viconia:~$ perl ./listing3.pl
World domination. But first, take care of the spam.
spsh:/> ls
No spam. Yet.
spsh:/> ?
Unknown command '?'; type 'help' for a list.
spsh:/> help
Type 'help command' for more detailed help.
Commands:
exit - exits the program
help - prints this screen, or help on 'command'
ls - Is there any spam? - no help available
spsh:/> exit

So far so good. Add a help_ls subroutine if you feel like explaining the intricacies of spam listing to your fellow spam handlers.

Source of the shell above:


package main;

Spsh->new->cmdloop;

package Spsh;
use base qw(Term::Shell);

sub preloop
{
$state_path = "";
binmode(STDOUT, ":utf8");

print "World domination. But first,"
."take care of the spam.\n";
}

sub prompt_str
{
return "spsh:/$state_path> ";
}

# Define an empty command, to avoid
# "unknown command" on an empty
# command-line.
sub run_
{
return;
}

sub run_ls
{
print "No spam. Yet.\n";
return;
}

sub smry_ls
{
return "Is there any spam?"
}

Tuesday 11 September 2007

The Beast Remade: Enter WWW::Mechanize

I had recently seen something called WWW::Mechanize mentioned in Google Hacks, 2nd edition. So I determined to have a play around with it, to see if there was something to be gained. An initial look seemed promising. In fairly short order, I had the web interface yelling at me to use a civilized browser, not this scary WWW-Mechanize thingy.

use WWW::Mechanize;

my $url = "http://spamproxy:8888/login.php";

$mech = WWW::Mechanize->new(autocheck => 1);

$mech->get($url);
print $mech->content;

A quick adjustment to $mech->agent took care of that ill temper. Adding invocations to $mech->save_content() allowed me to get a look at what it expected of me - which was simple, at first. It just wanted a password in the single visible form element:

my $password = "foobarword";
$mech->submit_form(fields => { password => $password });

A $mech->save_content() later, I had won a small victory. Next target had to be the web application's menu structure. The story so far:

use WWW::Mechanize;

my $url = "http://spamproxy:8888/login.php";
my $password = "foobarword";

$mech = WWW::Mechanize->new(autocheck => 1);

# Yeah, sure, this is a tested and approved browser
$mech->agent ("Mozilla/5.0 (X11; U; Linux i686;"
."en-US; rv:1.7.13) Gecko/20060418 Firefox/1.0.8");

# Get login page
$mech->get($url);

# Try to log in
$mech->submit_form(fields => { password => $password });

# Save the response page for a look around
$mech->save_content("loggedin.html");

Starting out, I worried that extensive Javascript use would make this hard or impossible. As perldoc WWW::Mechanize::FAQ states, Javascript is not supported and not likely to be until someone looks seriously into it. It turned out that this particular web interface had followed web accessibility guidelines pretty well and was navigable without Javascript support. If your local friendly web interface is less accommodating, well, the FAQ has some tips.

The menu structure turned out to be a mixed bag. Link names were constructed by Javascript, so I could not rely on them. Link URLs, however, were static and thus available for use. Some calls to ->follow_link() later, I had the quarantine main menu dumped to a HTML file for my perusal. This all looked very parsable and scriptable, and a sneak peek at the quarantine listings seemed to confirm my gut feeling.

The Beast Remade: Part the first

As web interfaces to various software have matured and multiplied, they have in many cases moved toward resembling traditional GUIs. Checkboxes, labeled buttons, client-side Javascript error checking and "do you really want to do this" dialogs point the way and do the required hand-holding for novice users. In many cases, this is a good thing - say, remote firmware updates or RAID configuration restores.

System administrators, however, are usually not novice users. Especially not when the web interface in question is the only available interface for a routine daily task. The task in question? To inspect a spam quarantine for false positives - a tiring chore at best. A somewhat cumbersome web interface slowed us down, so grumbles, less than complimentary epithets, and searches for a better way were inevitable.

The Powers That Be heard our grumbles, and were sympathetic to our point of view. This was not optimal, and we could use some time to find a way to speed this up. Nevertheless, any tricks to speed this up were to make use of The Official Interface, and not delve into product internals in the layers below - due to support contract clauses. So we could not manipulate the mail queue directly, even though that was the most obvious path to speedups.

(scary cliffhanger - stay tuned for episode 2!)