Quick start¶
Once you have installed the DadmaTools package, you can use it in your python project using import DadmaTools
.
You will find the main functions through the models
and datasets
modules – see the library documentation for more details about how to use the different functions for loading models and datasets.
For analysing texts in Persian, you will primarily need to import functions from dadmatools.pipeline.language
in order to load and use our pipeline.
The DadmaTools package provides you with several models for different NLP tasks using different frameworks. On this section, you will have a quick tour of the main functions of the DadmaTools package. For a more detailed description of the tasks and frameworks, follow the links to the documentation:
Embedding of text with Gensim and Fasttext
Lemmatizing with LSTM
Part of speech tagging (POS) with BERT
Named Entity Recognition (NER) with BERT
Dependency parsing and NP-chunking with BERT
All-in-one with the spaCy models¶
With DadmaTools you can try out different NLP tasks along with other pipelines that are already presented in spaCy. The main advantages of the spaCy model is that it is fast and it includes many functions based on NLP tasks which can be used easily.
Pre-processing tasks¶
Perform Part-of-Speech tagging, Named Entity Recognition and dependency parsing at the same time with the DadmaTools spaCy model. Here is a snippet to quickly getting started:
For text normalizing you can use the dadmatools.models.normalizer
.
from dadmatools.models.normalizer import Normalizer
normalizer = Normalizer(
full_cleaning=False,
unify_chars=True,
refine_punc_spacing=True,
remove_extra_space=True,
remove_puncs=False,
remove_html=False,
remove_stop_word=False,
replace_email_with="<EMAIL>",
replace_number_with=None,
replace_url_with="",
replace_mobile_number_with=None,
replace_emoji_with=None,
replace_home_number_with=None
)
text = """
<p>
دادماتولز اولین نسخش سال ۱۴۰۰ منتشر شده.
امیدواریم که این تولز بتونه کار با متن رو براتون شیرینتر و راحتتر کنه
لطفا با ایمیل dadmatools@dadmatech.ir با ما در ارتباط باشید
آدرس گیتهاب هم که خب معرف حضور مبارک هست:
https://github.com/Dadmatech/DadmaTools
</p>
"""
normalized_text = normalizer.normalize(text)
#<p> دادماتولز اولین نسخش سال 1400 منتشر شده. امیدواریم که این تولز بتونه کار با متن رو براتون شیرینتر و راحتتر کنه لطفا با ایمیل <EMAIL> با ما در ارتباط باشید آدرس گیتهاب هم که خب معرف حضور مبارک هست: </p>
#full cleaning
normalizer = Normalizer(full_cleaning=True)
normalized_text = normalizer.normalize(text)
#دادماتولز نسخش سال منتشر تولز بتونه کار متن براتون شیرینتر راحتتر کنه ایمیل ارتباط آدرس گیتهاب معرف حضور مبارک
Sequence labelling with BERT¶
For part-of-speech tagging, dependancy parsing, constituency parsing and named entity recognition, BERT models are presented.
import dadmatools.pipeline.language as language
# here lemmatizer and pos tagger will be loaded
# as tokenizer is the default tool, it will be loaded as well even without calling
pips = 'pos,dep,cons,ner'
nlp = language.Pipeline(pips)
# you can see the pipeline with this code
print(nlp.analyze_pipes(pretty=True))
# doc is an SpaCy object
doc = nlp('از قصهٔ کودکیشان که میگفت، گاهی حرص میخورد!')
dictionary = language.to_json(pips, doc)
print(dictionary) ## to show pos tags, dependancy parses, and constituency parses
print(doc._.ners)