Non-statistical Language-Blind Morpheme Extraction

Download as .zip Download as .tar.gz View on GitHub

This software is was designed by Zachary Bornheimer to do automated morpheme extraction. The goal is for an unsupervised, non-statistical, language-blind machine learning algorithm that could parse corpera of a variety of languages.

The Paper explaining the research is coming soon.

Here is the README file.

Morpheme Extraction System

This software allows for the programmatic 
extraction of morpheme candidates from a 
corpus into a defined morpheme-list location.

Licensed under the GPLv2.

If you change something or get something to 
work better, please let me know it will help
me improve in C and will help the project :-)

Research Paper that accompanied this project is coming soon.

Software Required for Functionality:
    gcc (with OpenMP compatibility enabled)
How to install?
Choose one of the following:
    make optimized
    make debug
    make all

Command-line Arguments:

Verbose Mode:      --verbose
Serial Processing: --serial or --sequential --process-sequentially
Full Processing:   --process
Output File:       --output-file REL-FILE-PATH
Corpus Dir:        --corpus-dir  REL-CORPUS-PATH

where REL-FILE-PATH and REL-CORPUS-PATH are relative paths to a
desired filename and/or corpus directory.

Verbose Mode gives more visual output, however it impacts speed.

Serial Processing yields data results for each file process as
    opposed to a conglomerate data processing experience :)

Full Processing yields serial and sequential results as if you
    were to have run the program with --serial the first time
    and then a second time without that flag.

Output File is the place in which data results are appended
    (it won't overwrite existing data).

Corpus Dir is the place where all the files that need to be
    processed reside.