Welcome to DadmaTools’s documentation!¶
DadmaTools is a repository for Natural Language Processing (NLP) resources for the Persian Language. It is a collection of available datasets, embeddings, and models for a variety of NLP tasks. The aim is to make it easier and more applicable to practitioners in the industry to use Persian NLP and hence this project is licensed to allow commercial use. The project features code examples on how to use the datasets, embeddings, and models in popular NLP frameworks such as spaCy, Transformers and Flair as well as Deep Learning frameworks such as PyTorch.
Installation¶
To get started using DadmaTools in your python project simply install the pip package. Installing the pip package will install all NLP libraries such as spaCy, torch, fasttext, and gensim.
Install with pip¶
To get started using DadmaTools simply install the project with pip:
pip install DadmaTools
You can check the requirements.txt
file to see what version the packages has been tested with.
Install from GitHub¶
Alternatively you can install the latest version from GitHub using:
pip install git+https://github.com/Dadmatech/dadmatools.git
Quick start¶
Once you have installed the DadmaTools package, you can use it in your python project using import DadmaTools
.
You will find the main functions through the models
and datasets
modules – see the library documentation for more details about how to use the different functions for loading models and datasets.
For analysing texts in Persian, you will primarily need to import functions from dadmatools.pipeline.language
in order to load and use our pipeline.
The DadmaTools package provides you with several models for different NLP tasks using different frameworks. On this section, you will have a quick tour of the main functions of the DadmaTools package. For a more detailed description of the tasks and frameworks, follow the links to the documentation:
Embedding of text with Gensim and Fasttext
Lemmatizing with LSTM
Part of speech tagging (POS) with BERT
Named Entity Recognition (NER) with BERT
Dependency parsing and NP-chunking with BERT
All-in-one with the spaCy models¶
With DadmaTools you can try out different NLP tasks along with other pipelines that are already presented in spaCy. The main advantages of the spaCy model is that it is fast and it includes many functions based on NLP tasks which can be used easily.
Pre-processing tasks¶
Perform Part-of-Speech tagging, Named Entity Recognition and dependency parsing at the same time with the DadmaTools spaCy model. Here is a snippet to quickly getting started:
For text normalizing you can use the dadmatools.models.normalizer
.
from dadmatools.models.normalizer import Normalizer
normalizer = Normalizer(
full_cleaning=False,
unify_chars=True,
refine_punc_spacing=True,
remove_extra_space=True,
remove_puncs=False,
remove_html=False,
remove_stop_word=False,
replace_email_with="<EMAIL>",
replace_number_with=None,
replace_url_with="",
replace_mobile_number_with=None,
replace_emoji_with=None,
replace_home_number_with=None
)
text = """
<p>
دادماتولز اولین نسخش سال ۱۴۰۰ منتشر شده.
امیدواریم که این تولز بتونه کار با متن رو براتون شیرینتر و راحتتر کنه
لطفا با ایمیل dadmatools@dadmatech.ir با ما در ارتباط باشید
آدرس گیتهاب هم که خب معرف حضور مبارک هست:
https://github.com/Dadmatech/DadmaTools
</p>
"""
normalized_text = normalizer.normalize(text)
#<p> دادماتولز اولین نسخش سال 1400 منتشر شده. امیدواریم که این تولز بتونه کار با متن رو براتون شیرینتر و راحتتر کنه لطفا با ایمیل <EMAIL> با ما در ارتباط باشید آدرس گیتهاب هم که خب معرف حضور مبارک هست: </p>
#full cleaning
normalizer = Normalizer(full_cleaning=True)
normalized_text = normalizer.normalize(text)
#دادماتولز نسخش سال منتشر تولز بتونه کار متن براتون شیرینتر راحتتر کنه ایمیل ارتباط آدرس گیتهاب معرف حضور مبارک
Sequence labelling with BERT¶
For part-of-speech tagging, dependancy parsing, constituency parsing and named entity recognition, BERT models are presented.
import dadmatools.pipeline.language as language
# here lemmatizer and pos tagger will be loaded
# as tokenizer is the default tool, it will be loaded as well even without calling
pips = 'pos,dep,cons,ner'
nlp = language.Pipeline(pips)
# you can see the pipeline with this code
print(nlp.analyze_pipes(pretty=True))
# doc is an SpaCy object
doc = nlp('از قصهٔ کودکیشان که میگفت، گاهی حرص میخورد!')
dictionary = language.to_json(pips, doc)
print(dictionary) ## to show pos tags, dependancy parses, and constituency parses
print(doc._.ners)
NLP Tools Tasks and Features¶
Datasets¶
You can see the details in Different Datasets.
Word Embeddings¶
You can see the details in Word Embeddings.
Normalizer¶
You can see the details in Normalizer.
Lemmatizer¶
You can see the details in Lemmatizer.
POS Tagger¶
You can see the details in POS Tagger.
Dependancy Parser¶
You can see the details in Dependancy Parser.
Constituency Parser¶
You can see the details in Constituency Parser.
Chunker¶
You can see the details in Chunker.
Contributing¶
If you want to contribute to the DadmaTools project, your help is very welcome. You can contribute to the project in many ways:
Help us write good tutorials on Persian NLP use-cases.
Contribute with your own pretrained NLP models, embeddings, or datasets in Persian. You can add also your own pipeline using
add_pipe
when you have createddadmatools.pipeline.language.pipeline
.
import dadmatools.pipeline.language as language
pips = '<choose whatever you want for pipelion>'
nlp = language.Pipeline(pips)
nlp.add_pipe('<your own OR spaCy default pipeline>')
Create GitHub issues with questions and bug reports.
Notify us of other Persian NLP resources or tell us about any good ideas that you have for improving the project through the Discussions section of the GitHub repository.