FOSSology Project Logo FOSSology
Advancing open source analysis and development
 

A Simple "word count" Agent

In this tutorial, we will write an agent to count words in a file and store the result in the database.

say something about general agent creation

talk about how the agent is run thru the schedule

mention scheduler.conf

wc_agent uses the engine-shell so, you need to talk about that

Taken directly from the original README file written by nealk:

To start off, let's introduce four components common to all agents:
  - The agent itself. This performs the analysis and stores results in the 
database.  Agents can be built in any language – from shell script to C. 
  - The Scheduler.  Agents are executed through a scheduler.
  - The Interface.  The user interface (UI) or command-line interface (CLI)
schedules a job (individual executions of the agent) via a //jobqueue// and
displays any results.
  - The jobqueue.  Every job must have an associated //jobqueue record// 
containing agent-specific arguments.  The arguments, can be either sql 
for the agent to execute or data entered by the user through the interface.  
The jobqueue record is passed to the agent by the scheduler. 

The jobqueue operates in two modes: generic and per-host.
The basic idea is that the file repository may be split across hosts.
Rather than transfering files across the network (e.g., NFS), it may be
faster for agents to run on the same host as the file.

For example, the wget_agent downloads a file from the Internet and stuffs
it into the repository.  Since the repository host is unknown, wget_agent
can really run on any host.  This is an example of a generic agent.

In contrast, the license analysis agents process files in the repository.
Since the hosts are known, it is faster to run these agents on the specific
file.

There is one other distinction: the generic-host entries in the jobqueue
contain one request.  The value of the jobqueue.jq_args is passed as-is to
the agent and the agent is assumed to know how to parse the line.  In
contrast, the host-specific agents have an SQL line in the
jobqueue.jq_args.  The scheduler runs the SQL and sends the results of this
multi-SQL query (MSQ) to the agent.

The difference between generic-host and MSQ is critical: if an agent needs
to perform a task on hundreds of DB items, then it either needs to process
the SQL query itself (using parameters from the jq_args), or it needs to
process one item that the scheduler retrieves using the MSQ.

Since this example wc agent is expected to run on thousands of files in the
repository, it is a good idea to use the host-specific, MSQ option.

With MSQ queries, we need to know the data and the stop condition.  The
stop condition identifies when the file has been processed.  In this
example, there is a custom table, "agent_wc", for storing results.  The SQL
for the jq_args should return every pfile and repository file name
associated with the project and that does not already exist in the agent_wc
table:
  SELECT pfile_sha1 || '.' || pfile_md5 || '.' || pfile_size AS pfile, pfile_fk
  FROM uptreeup
  WHERE upload_fk = 619
  AND pfile_fk NOT IN (SELECT agent_wc.pfile_fk FROM agent_wc)
  LIMIT 5000;

The "LIMIT 5000" ensures that this job does not hog all of the scheduler's
resources.
The "619" is an example -- it should match the upload_fk for the project
and be set by the Interface.
Assuming everything gets processed, this will return no rows when
everything is done processing.  That's how the scheduler will know that
there is no more work to perform.

Since this job should run on host-specific fields, the
jobqueue.jq_runonpfile should be set to "pfile".  This is the name of the
column from the SQL that denotes the host-specific information.

#!/bin/bash
# Example wc agent, written in shell script.
# This should be used with engine-shell.
#
# Copyright (C) 2007 Hewlett-Packard Development Company, L.P.
# 
#  This program is free software; you can redistribute it and/or
#  modify it under the terms of the GNU General Public License
#  version 2 as published by the Free Software Foundation.
#  
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#  
#  You should have received a copy of the GNU General Public License along
#  with this program; if not, write to the Free Software Foundation, Inc.,
#  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.

# Set the path.
# If the paths in Makefile.conf change, then these will need to change.
export PATH=/usr/bin:/usr/local/fossology:/usr/local/fossology/agents:/usr/local/fossology/test.d

# This agent should appear in the scheduler.conf as:
# agent=wc | /usr/local/fossology/agents/engine-shell wc_agent '/usr/local/fossology/agents/wc_agent'

# engine-shell will convert all of the SQL columns into environment
# variables.  The MSQ will return pfile=... and pfile_fk=...
# These will become $ARG_pfile and $ARG_pfile_fk.

if [ "$ARG_pfile" == "" ] ; then
  echo "FATAL: \$ARG_pfile not set. Abording."
  exit -1
fi
if [ "$ARG_pfile_fk" == "" ] ; then
  echo "FATAL: \$ARG_pfile_fk not set. Abording."
  exit -1
fi

# Get the path to the actual file
RepFile=`reppath files "$ARG_pfile"`

# Get the word-count values and insert them into the database using dbinit.
wc "$RepFile" 2>/dev/null | while read Lines Words Bytes Name ; do
  # Convert wc to an SQL statement
  echo "!INSERT INTO agent_wc (pfile_fk,wc_words,wc_lines) VALUES ($ARG_pfile_fk,$Words,$Lines);"
  # The initial "!" tells dbinit to ignore insert failures.
  # Don't worry about checking if the value exists... If it did exist, then
  # the MSQ would have never called this program.
  # And if two agents happen to run on the same data, then the DB constraint
  # for unique values will prevent duplicates.
done | dbinit -

exit 0;  # done successfully
 
1.0.0/wc_agent.txt · Last modified: 2009/04/16 12:42 (external edit)

Copyright (C) 2007-2009 Hewlett-Packard Development Company, L.P.
FOSSology Project documentation is licensed under the GNU Free Documentation License Version 1.2
Recent changes RSS feed Valid XHTML 1.0 Valid CSS3 Driven by DokuWiki