GlossAPI

A library for processing academic texts in Greek and other languages, developed by ΕΕΛΛΑΚ.

Features

PDF Processing: Extract text content from academic PDFs with structure preservation
Quality Control: Filter and cluster documents based on extraction quality
Section Extraction: Identify and extract academic sections from documents
Section Classification: Classify sections using machine learning models
Greek Language Support: Specialized processing for Greek academic texts
Metadata Handling: Process academic texts with accompanying metadata
Customizable Annotation: Map section titles to standardized categories

Installation

pip install glossapi

Usage

The recommended way to use GlossAPI is through the Corpus class, which provides a complete pipeline for processing academic documents:

from glossapi import Corpus
import logging

# Configure logging (optional)
logging.basicConfig(level=logging.INFO)

# Initialize Corpus with input and output directories
corpus = Corpus(
    input_dir="/path/to/documents",
    output_dir="/path/to/output"
    # metadata_path="/path/to/metadata.parquet",  # Optional
    # annotation_mapping={
    #     'Κεφάλαιο': 'chapter', # i.e. a label in document_type column : references text type to be annotated chapter or text for now
    #     # Add more mappings as needed
    # }
)

# Step 1: Extract documents (with quality control)
corpus.extract()

# Step 2: Extract sections from filtered documents
corpus.section()

# Step 3: Classify and annotate sections
corpus.annotate()

License

This project is licensed under the European Union Public Licence 1.2 (EUPL 1.2).

Name		Name	Last commit message	Last commit date
Latest commit History 192 Commits
.github/workflows		.github/workflows
Greek_variety_classification		Greek_variety_classification
pipeline		pipeline
scraping		scraping
.gitignore		.gitignore
README.md		README.md
dataset_progress.md		dataset_progress.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GlossAPI

Features

Installation

Usage

License

About

Contributors 9

Languages

eellak/glossAPI

Folders and files

Latest commit

History

Repository files navigation

GlossAPI

Features

Installation

Usage

License

About

Topics

Resources

Stars

Watchers

Forks

Contributors 9

Languages