From bryce@osdl.org Fri Sep 24 14:46:58 2004
Date: Wed, 14 Jul 2004 12:15:37 -0700 (PDT)
From: Bryce Harrington <bryce@osdl.org>
To: stp-devel@lists.sourceforge.net
Subject: [Stp-devel] Conceptual Proposal for Multi-Node Testing (plus some)


This is a 'kill two birds with one stone' idea (actually *three*
birds...) for multi-node testing plus a couple other capabilities we
need, that can be solved with one system:


Problem Statement
=================
1.  During the recent investigation of the LTP issue, we found it
    extremely convenient for the engineers to hop directly onto an stp box
when they found they could not replicate the issue quickly on their dev
box.  Within an hour of poking around on the stp box they found a way to
replicate it, and returned the stp system to the queue.

Putting this machine into maintenance mode for an hour probably saved
several hours of effort.  I think it is highly likely we will find
similar benefits in the future in doing this.

However, the process for making this happen was not as smooth as it
could be:

   - System was not configured for the test in question
   - Required admin involvement; would save time to automate

2.  For LTP, a new test is released each month.  I fortunately have a
machine I can compile and test out LTP on, but:

   - Project machines are not configured like STP machines, so
     replicating problems seen in STP can be time consuming

Being able to 'check out' an STP machine for a short period for me to
log in and investigate LTP compile/run issues would save that setup
effort.  This is probably true for other test authors as well.

3.  For network tests, we generally need to have multiple machines
assigned to that test.  For example, many require a server and one or
more clients.

   - We currently don't have a solid concept for how to work out
     checking machines out for a given test to use in a 'multi-node'
     fashion


Concept
=======
Here is an idea for a 'machine reservation system' that addresses the
needs of all three cases above, in a (hopefully) straightforward
fashion:

   * A SOAP call is added to the STP framework that requests a machine
     be 'checked out'.  The caller can specify some selection criteria,
     such as that it be 'any 4-way', or 'stp1-003', or some setup
     requirements such as 'RH9, with the tiobench test suite set up to
     run with options foo and bar'.  The user can also specify an email
     address to notify when the system becomes available.
     Alternatively, they can leave the email blank and use a second SOAP
     call to poll the checkout status.  We can provide access to these
     SOAP calls via a cmdline script that users can run from their local
     machines.

   * When a machine is checked out, it is put on a time-out.  After the
     time has expired, the machine will automatically return to the
     queue.  This way if someone checks out a machine but isn't around
     to use it when it becomes available, it won't sit idly checked out
     forever.

   * A SOAP call is added to permit reserving the checked-out machine.
     This allows the user to request to postpone or cancel the above
     timeout, so that they won't get surprised with the machine
     returning to the queue all of a sudden while they're working on it.
     They can either specify a period of time to allocate (e.g., 120
     min), or a cut-off time (6:00 pm Friday), or cancel it entirely.
     The maximum amount of time allowed for a given user is controllable
     at the administrative level, so we can ensure no one user
     monopolizes machine time (unless we authorize them to).  A script
     for executing this function would be placed in the /root dir of the
     machine in question, for them to run as soon as they've logged in.

   * During normal STP execution, tests are always assigned 1 machine
     and the framework installs and executes the test there, just as we
     currently do.

   * For multi-node tests, though, the process occurs as normal, but
     within the 'master' host's wrap.sh script, it also requests one or
     more client machines, via the aforementioned check-out SOAP call.
     In this call, the master will indicate what setup work that the
     framework should do.  The master then uses the polling SOAP call to
     learn when its clients become ready to use.

   * This gives the master some options.  It can start running with one
     client immediately on its availability, bringing others online as
     they become available, or wait until all are ready.  For instance,
     if it needs to test on 1-ways, 2-ways, and an 8-way, but the 1-ways
     are in use, it could start with the others first.  The master could
     also perform additional custom setup work on the clients beyond
     what the STP framework does, if needed.  Further, the master can
     dismiss its clients immediately as soon as it no longer needs them,
     then perform report generation, result upload, etc. subsequently.

   * As the network tests progress, the master will monitor and adjust
     the reservation times of clients as needed.  The master would never
     do an indefinite reservation on a client.  This way, if things go
     horribly wrong for the master, everything will eventually just time
     out, and all of the clients will be returned to the pool normally.
     Of course, we could add a proactive check as well, that if the
     master stops responding to queries, we kill it and all of its
     children, in one fell swoop.

One of the nice things about this approach is that very little change is
needed to be done to the STP framework (other than the new SOAP calls)
to enable it.  Further, it gives the test authors a great deal of
flexibility in determining how things should work, and allows them to
alter the behavior without needing any mods to STP itself.  Since it is
controlled via wrap.sh, this provides a straightforward tie-in with the
Test Options functionality, thus requiring no changes to the STP test
request form; the wrap.sh for the network test could simply have a
--num_clients option, or whatever, for instance.

Here's a couple advanced use cases:

A) The user wants full control over the master, but wants STP to take
   care of the dirty work of scraping dead clients off the floor and
   reinitting them.  So each week he checks out the master machine and
   reserves it until 6pm Friday.  He logs in and runs tests as he
   wishes, restarting them manually as needed, etc.  STP maintains the
   client machines for him, so his wrap.sh just checks them out
   as-needed, and when he's not running tests, they automatically return
   to the pool to do other work, leaving his master server untouched in
   the meantime.

B) The user is load-testing different NFS servers, and just wants to
   quickly pig-pile the NFS servers with a ton of client load.  External
   to STP she's built a collection of client machines that perform
   various server requests, activated by a trigger.  She uses STP to set
   up and invoke the master machine, which when ready sends the clients
   a message to 'Bring it on!'  The user customizes the way the clients
   attack the master server as needed; for instance, she can start with
   1 client operating for a while, while the other clients do stuff for
   other master servers, gradually ganging them all up on one to
   complete the big stress tests.  Using STP, she then queues up tests
   for each of the NFS servers she's comparing on 2, 4, and 8-way
   machines, and lets everything run over the weekend.


Conclusion
==========
By implementing a machine check-out and reservation system, this
approach provides an adequate solution to our network test needs in a
(hopefully) quick-and-easy-to-implement fashion, that gives us
flexibility for a huge variety of network testing scenario.

We've been finding SOAP a handy and relatively straightforward way of
communicating between PLM and STP so far, so we can feel fairly
confident that building the system around SOAP calls will work well.
Because SOAP is a standardized network RPC mechanism, it also fits in
with long-range goals of being able to interoperate with off-site
machinery, so expanding our use and experience-base with SOAP will make
it easier to figure out how to tie in external machinery with STP, if
and when we need to do so.

Further, in addition to providing a scheme for doing multi-node testing,
this approach adds very useful time-saving capabilities for developers
(in case #1) and test authors (in case #2), that will make issues easier
to troubleshoot and address quickly.  Even if we never used the
multi-node capability, this payoff of this benefit alone could make it
worth it.





-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Stp-devel mailing list
Stp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/stp-devel
