Pretrained Persian embeddings¶
This repository keeps a list of pretrained word embeddings publicly available in Persian. The dadmatools.embeddings
provides functions for using the embeddings as well as using common functions dealing with them.
Name | Embedding Algorithm | Corpus |
---|---|---|
glove-wiki | glove | Wikipedia |
fasttext-commoncrawl-bin | fasttext | CommonCrawl |
fasttext-commoncrawl-vec | fasttext | CommonCrawl |
word2vec-conll | word2vec | Persian CoNLL17 corpus |
Embeddings are a way of representing text as numeric vectors, and can be calculated both for chars, subword units, words, sentences or documents. There Persian word embedding models can be used easily using DadmaTools.
from dadmatools.embeddings import get_embedding, get_all_embeddings_info, get_embedding_info
from pprint import pprint
pprint(get_all_embeddings_info())
#get embedding information of specific embedding
embedding_info = get_embedding_info('glove-wiki')
#### load embedding ####
word_embedding = get_embedding('glove-wiki')
#get vector of the word
print(word_embedding['سلام'])
#vocab
vocab = word_embedding.get_vocab()
### some useful functions ###
print(word_embedding.top_nearest("زمستان", 10))
print(word_embedding.similarity('کتب', 'کتاب'))
print(word_embedding.embedding_text('امروز هوای خوبی بود'))