FOSSology Project Logo FOSSology
Advancing open source analysis and development
 

Sentence Classifier License Agent

Here is an overview of the new sentence classifier license agent and other work Adam is working on. Currently this license agent is called F1 but will be renamed when moved out of projects and into the release tree. Adam developed a proof-of-concept for a sentence based license detector, in python (f1.py). We have tested this enough to know that that it is significantly faster than the current bSAM license agent, more flexible (we won't need licterms or 1sl), and accurate. To make it into a production module, we need to optimize it's speed and make the code production ready. So Adam has been rewriting everything in c or cpp. The library for the Maximum Entropy Model that is libmaxent from http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html. It is written in cpp, hence the reason this agent is in cpp. libmaxent is LGPL.

Similar to bSAM using License.bsam to analyze a file under test, F1 uses a database of reference license sentences. Each file to analyze is broken into sentences, each sentence is matched to all the sentences in all the reference licenses. Based on how many sentences match, how many sentences are in the reference license and the probability of each sentence match, F1 assigns a score denoting how good the match is. I simplified the above to not confuse the point by mentioning optimizations. In reality, not every sentence is compared. F1 uses KD-trees and leader follower strategies to quickly determine the most similar sentences in the database to compare with. This significantly reduces the sentence comparisons.

Here is how this flows:

1. Create a sentence model (train_sentence_model)

In order to analyze a file based on sentences, f1 needs to be able to recognize a sentence. To build the sentence model we need training data. This works by taking a file and tagging each sentence. For example:

<SENTENCE>                    GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007</SENTENCE>
<SENTENCE>Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/ 
</SENTENCE>
<SENTENCE>Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.</SENTENCE>
<SENTENCE>TERMS AND CONDITIONS</SENTENCE>

Initially, this has to be done manually. You will find this and other training sets in fossology/trunk/project/adam/Training.
The program trunk/project/adam/sentence_base_classification/c/ train_sent_model uses this training set to create SentenceModel.dat, the actual model to be used by f1:

    train_sent_model -f <trainingdata> -o <sentence_model>

Here the -f parameter points to a text file that contains the path to training files. Each line should have the absolute path of the training file. The -o parameter specifies where the binary model will be written.

train_sent_model.cpp has been written and needs some non-functional code cleanup. train_sent_model.cpp is only 49 lines long because of its use of libmaxent.

2. New training data (label_sentences)

Once this initial sentence model, based on hand labeled sentences, was created, Adam implemented a maximum entropy model to automatically label sentences:

    label_sentences -m <sentence_model> <file>

The program, label_sentences, uses <sentence_model> to output (stdout) a new .sent file which should be manually checked, and then added to the sentence training files (fossology/trunk/project/adam/ Training/Sentences). label_sentences speeds up creating new training data. It can also be used to verify that the current model is working correctly.

label_sentences.cpp is finished (less some non-functional code cleanup).

3. Create the reference license database (database)

This is the database of all the reference license sentences. The maximum entropy model created by train_sent_model is used to break the reference licenses into sentences. Each sentence is converted into a text frequency vector (tfv). The tfv's are then normalized for speed in later calculations. The tfv's and sentences are saved to a model file for use in the f1 algorithm.

    database -f <trainingdata> -m <sentence_model> -o <database_model>

The reference licenses are passed to the algorithm using the -f parameter. Each reference license should be placed on its own line in the trainingdata file. A sentence model must be specified using the -m parameter. The -o parameter allows one to specify the name of the binary model produced.

4. Create the agent

The functional code is f1.py. There is a rudimentary, incomplete agent version called agent.py. It needs to be rewritten in c and have additional code added to make it into a fully functional agent. The agent requires the reference license database and the sentence model from the previous steps. Providing http://fossology.org/writing_an_agent and the two example agents slated for this are done, the rewrite into a complete agent should only take a week or so. Estimated completion date is Aug 24.

5. Create the UI

This has not been started.

6. integration into trunk

None of the code has been moved from project/adam to fossology/ agents. The makefile hasn't been updated to fit into fossology/ agents. Dependencies haven't been added to packages. No docs (except sparse README's) have been written.

Copyright agent

Following F1, Adam will be starting to train data for the copyright agent. Similar to the license agent, the copyright agent will use a maximum entropy model to learn how copyrights are constructed.

License_test

Adam has started on a program that will do a quick scan to see if a file contains any license. FYI, Nomos (not fo_nomos) already has this built in. This is an optimization so that the sentence classifier only has to operate within a range of bytes that might have a license. A POC has not yet been written so it is unclear if this will be a worthwhile optimization. That's why it is number 8 on this list (which is in priority order).

 
f1.txt · Last modified: 2009/08/22 13:52 by adamb

Copyright (C) 2007-2009 Hewlett-Packard Development Company, L.P.
FOSSology Project documentation is licensed under the GNU Free Documentation License Version 1.2
Recent changes RSS feed Valid XHTML 1.0 Valid CSS3 Driven by DokuWiki