Pretrained Persian embeddings¶

This repository keeps a list of pretrained word embeddings publicly available in Persian. The dadmatools.embeddings provides functions for using the embeddings as well as using common functions dealing with them.

Name	Embedding Algorithm	Corpus
glove-wiki	glove	Wikipedia
fasttext-commoncrawl-bin	fasttext	CommonCrawl
fasttext-commoncrawl-vec	fasttext	CommonCrawl
word2vec-conll	word2vec	Persian CoNLL17 corpus

Embeddings are a way of representing text as numeric vectors, and can be calculated both for chars, subword units, words, sentences or documents. There Persian word embedding models can be used easily using DadmaTools.

from dadmatools.embeddings import get_embedding, get_all_embeddings_info, get_embedding_info
from pprint import pprint

pprint(get_all_embeddings_info())

#get embedding information of specific embedding
embedding_info = get_embedding_info('glove-wiki')

#### load embedding ####
word_embedding = get_embedding('glove-wiki')

#get vector of the word
print(word_embedding['سلام'])

#vocab
vocab = word_embedding.get_vocab()

### some useful functions ###
print(word_embedding.top_nearest("زمستان", 10))
print(word_embedding.similarity('کتب', 'کتاب'))
print(word_embedding.embedding_text('امروز هوای خوبی بود'))