Welcome to DadmaTools’s documentation!

DadmaTools is a repository for Natural Language Processing (NLP) resources for the Persian Language. It is a collection of available datasets, embeddings, and models for a variety of NLP tasks. The aim is to make it easier and more applicable to practitioners in the industry to use Persian NLP and hence this project is licensed to allow commercial use. The project features code examples on how to use the datasets, embeddings, and models in popular NLP frameworks such as spaCy, Transformers and Flair as well as Deep Learning frameworks such as PyTorch.

Installation

To get started using DadmaTools in your python project simply install the pip package. Installing the pip package will install all NLP libraries such as spaCy, torch, fasttext, and gensim.

Install with pip

To get started using DadmaTools simply install the project with pip:

pip install DadmaTools 

You can check the requirements.txt file to see what version the packages has been tested with.

Install from GitHub

Alternatively you can install the latest version from GitHub using:

pip install git+https://github.com/Dadmatech/dadmatools.git

Quick start

Once you have installed the DadmaTools package, you can use it in your python project using import DadmaTools.

You will find the main functions through the models and datasets modules – see the library documentation for more details about how to use the different functions for loading models and datasets. For analysing texts in Persian, you will primarily need to import functions from dadmatools.pipeline.language in order to load and use our pipeline.

The DadmaTools package provides you with several models for different NLP tasks using different frameworks. On this section, you will have a quick tour of the main functions of the DadmaTools package. For a more detailed description of the tasks and frameworks, follow the links to the documentation:

All-in-one with the spaCy models

With DadmaTools you can try out different NLP tasks along with other pipelines that are already presented in spaCy. The main advantages of the spaCy model is that it is fast and it includes many functions based on NLP tasks which can be used easily.

Pre-processing tasks

Perform Part-of-Speech tagging, Named Entity Recognition and dependency parsing at the same time with the DadmaTools spaCy model. Here is a snippet to quickly getting started:

For text normalizing you can use the dadmatools.models.normalizer.

from dadmatools.models.normalizer import Normalizer

normalizer = Normalizer(
    full_cleaning=False,
    unify_chars=True,
    refine_punc_spacing=True,
    remove_extra_space=True,
    remove_puncs=False,
    remove_html=False,
    remove_stop_word=False,
    replace_email_with="<EMAIL>",
    replace_number_with=None,
    replace_url_with="",
    replace_mobile_number_with=None,
    replace_emoji_with=None,
    replace_home_number_with=None
)

text = """
<p>
دادماتولز اولین نسخش سال ۱۴۰۰ منتشر شده. 
امیدواریم که این تولز بتونه کار با متن رو براتون شیرین‌تر و راحت‌تر کنه
لطفا با ایمیل dadmatools@dadmatech.ir با ما در ارتباط باشید
آدرس گیت‌هاب هم که خب معرف حضور مبارک هست:
 https://github.com/Dadmatech/DadmaTools
</p>
"""
normalized_text = normalizer.normalize(text)
#<p> دادماتولز اولین نسخش سال 1400 منتشر شده. امیدواریم که این تولز بتونه کار با متن رو براتون شیرین‌تر و راحت‌تر کنه لطفا با ایمیل <EMAIL> با ما در ارتباط باشید آدرس گیت‌هاب هم که خب معرف حضور مبارک هست: </p>

#full cleaning
normalizer = Normalizer(full_cleaning=True)
normalized_text = normalizer.normalize(text)
#دادماتولز نسخش سال منتشر تولز بتونه کار متن براتون شیرین‌تر راحت‌تر کنه ایمیل ارتباط آدرس گیت‌هاب معرف حضور مبارک

Sequence labelling with BERT

For part-of-speech tagging, dependancy parsing, constituency parsing and named entity recognition, BERT models are presented.

import dadmatools.pipeline.language as language

# here lemmatizer and pos tagger will be loaded
# as tokenizer is the default tool, it will be loaded as well even without calling
pips = 'pos,dep,cons,ner' 
nlp = language.Pipeline(pips)

# you can see the pipeline with this code
print(nlp.analyze_pipes(pretty=True))

# doc is an SpaCy object
doc = nlp('از قصهٔ کودکیشان که می‌گفت، گاهی حرص می‌خورد!')

dictionary = language.to_json(pips, doc)
print(dictionary) ## to show pos tags, dependancy parses, and constituency parses
print(doc._.ners)

NLP Tools Tasks and Features

Datasets

You can see the details in Different Datasets.

Word Embeddings

You can see the details in Word Embeddings.

Normalizer

You can see the details in Normalizer.

Lemmatizer

You can see the details in Lemmatizer.

POS Tagger

You can see the details in POS Tagger.

Dependancy Parser

You can see the details in Dependancy Parser.

Constituency Parser

You can see the details in Constituency Parser.

Chunker

You can see the details in Chunker.

Contributing

If you want to contribute to the DadmaTools project, your help is very welcome. You can contribute to the project in many ways:

  • Help us write good tutorials on Persian NLP use-cases.

  • Contribute with your own pretrained NLP models, embeddings, or datasets in Persian. You can add also your own pipeline using add_pipe when you have created dadmatools.pipeline.language.pipeline.

import dadmatools.pipeline.language as language
pips = '<choose whatever you want for pipelion>' 
nlp = language.Pipeline(pips)
nlp.add_pipe('<your own OR spaCy default pipeline>')
  • Create GitHub issues with questions and bug reports.

  • Notify us of other Persian NLP resources or tell us about any good ideas that you have for improving the project through the Discussions section of the GitHub repository.