DatasetsΒΆ

This section keeps a list of Persian NLP datasets publicly available.

Dataset Task
PersianNER Named Entity Recognition
ARMAN Named Entity Recognition
Peyma Named Entity Recognition
FarsTail Textual Entailment
FaSpell Spell Checking
PersianNews Text Classification
PerUDT Universal Dependency
PnSummary Text Summarization
SnappfoodSentiment Sentiment Classification
TEP Text Translation(eng-fa)
WikipediaCorpus Corpus
PersianTweets Corpus

We Will add the description of all datasets in the future.

from dadmatools.datasets import FarsTail
from dadmatools.datasets import SnappfoodSentiment
from dadmatools.datasets import Peyma
from dadmatools.datasets import PerUDT
from dadmatools.datasets import PersianTweets
from dadmatools.datasets import PnSummary


farstail = FarsTail()
#len of dataset
print(len(farstail.train))

#like a generator
print(next(farstail.train))

#dataset details
pn_summary = PnSummary()
print('PnSummary dataset information: ', pn_summary.info)

#loop over dataset
snpfood_sa = SnappfoodSentiment()
for i, item in enumerate(snpfood_sa.test):
    print(item['comment'], item['label'])

#get first tokens' lemma of all dev items
perudt = PerUDT()
for token_list in perudt.dev:
    print(token_list[0]['lemma'])

#get NER tag of first Peyma's data
peyma = Peyma()
print(next(peyma.data)[0]['tag'])

#corpus 
tweets = PersianTweets()
print('tweets count : ', len(tweets.data))
print('sample tweet: ', next(tweets.data))