Abbas Safardoost

Psycholinguist, Computational Linguist, Translator.

Work Experience


Freelance Translator & Researcher
2015 - Current

Lexicographer
2013 - 2015

Copy Writer
2010 - 2013

Education


Tarbiat Modares University
Sep 2019 - May 2025
Ph.D in Linguistics
Thesis Title:

Facing Persian in the Wild:
Corpus Compilation, Annotation, Tag Encoding, and Processing of Digital Writing Register

The proliferation of digital communication technologies has profoundly reshaped linguistic landscapes, giving rise to a distinct digital writing register that now constitutes a substantial component of daily interaction. Unlike traditional written registers governed by established prescriptive norms, the emergent register is characterized by a lack of central regulatory authority, positioning it as a 'wild' register within linguistic inquiry. This characteristic necessitates a critical re-evaluation of conventional methodologies for linguistic analysis and processing. This study proposes and applies a novel methodological framework designed specifically for engaging with this complex linguistic register. We revisit the processes of corpus compilation, preprocessing, encoding, and annotation, aiming to streamline fundamental processing tasks while rigorously preserving the inherent linguistic and orthographic diversity characteristic of digital writing. The proposed approach deliberately maintains variations during initial preprocessing stages, deferring the crucial task of word identity disambiguation to subsequent, more context-aware processing steps. Our methodology commenced with the extraction of a large-scale corpus, designated '1400', comprising one million random tweets sourced from the Persian calendar year 1400. A one-million- token sub-corpus was subsequently isolated to serve as the focus for detailed preprocessing, part- of-speech tagging, and lemmatization. During the normalization and tokenization phases, we implemented two primary strategies to enhance processing simplicity: the systematic removal of elements that introduce conventional orthographic variation (such as the zero-width non-joiner) and the strategic segmentation of multi-component elements possessing independent grammatical identities. For corpus annotation, we introduce 'Tagframe', a novel and flexible framework for tag encoding. The framework is engineered to facilitate the granular representation of linguistic, paralinguistic, and intra-lexical information (such as clitic structures). Leveraging Tagframe, we developed a standardized tagframe specifically tailored for Persian digital text. This involved a necessary revision of certain traditional grammatical concepts and the inclusion of new categories and layered features representing linguistic, internal structure, and paralinguistic characteristics. Lemmatization was performed utilizing diacritization to resolve homographic ambiguity. The final corpus annotation employed 25 linguistic categories supplemented by 82 linguistic features across three distinct layers: linguistic, intra-lexical, and paralinguistic. Statistical analysis of the annotated corpus, including comparison with other corpora, revealed a notably high frequency of hapax legomena, indicative of the corpus's rich diversity, and a high frequency of verbs, suggesting shared contextual characteristics with spoken language registers. The comprehensive methodology and the resulting annotated corpus presented herein establish a foundational resource and a replicable workflow that can significantly facilitate the creation and processing of similar corpora, thereby enabling further linguistic and computational investigations into this and related language registers.


Tarbiat Modares University
Sep 2013 - 2016
Master Degree in Linguistics
Thesis Title:

Inflection of Tense in Persian Verbs and How It Is Processed: a psycholinguistics Perspective

In this research, we studied the Persian inflectional tense forms and how they are processed. Since historical studies are dominant in the study of language in Persian, we dedicated the first two questions of this research to revise some of the concepts in the area. In the first question, we asked about the categorization of Persian verbs based on the concept of regularity. The results showed that there are three categories in Persian verbs: regular, irregular, and alternative. Regulars are an open category which their past tense are produced by adding the regular suffix (: -id) to root and don’t belong to the other categories. Irregulars are a close category which there are no systematic relation between their roots and past tense forms, and they are represented as a list of connected pairs; alternatives, which are a kind of regular verb, are an open category which doesn’t belong to irregulars, their roots consist of more than two syllables, end with -ɑn, and they use -d and -id suffixes to produce their past tense forms. In the second question, we asked about the model which represents the process of tense inflection. The results showed that the present tense of regulars and irregulars, in addition to the past tense of most of the regulars, are produced by retrieving the root and adding proper affixes. In the other hand, the past tense of irregulars and some of the high-frequency regulars are produced by retrieving of the past stem from memory. In the third question of this research, we asked about the mental mechanisms that involve in producing Persian verbs. Steven Pinker claims that the distinction between regulars and irregulars roots in two distinct mechanisms in which rule mechanism involves in producing the regulars and rote mechanism involves in producing irregulars. The results of this research confirm the Pinker’s claim. To examine these three questions we used a wide range of evidence, but the main source of the evidence came from an experiment in which we presented participants by the present tense of verbs and asked them to produce the past tense of these verbs as fast as possible. Then we gathered the errors that they made and used them as a source for our purposes.


Shiraz University
Sep 2009 - 2013
Bachelor Degree in Linguistics

Publication


Journal Articles


The Category and Structure of Persian Numerals

Iranian Journal of Comparative Linguistic Research

2024

10.22084/rjhll.2024.28776.2303


Encyclopedia Entries


Encyclopedia of Persian Language and Literature (vol. VI)

Vafa Zavvarei

Academy of Persian Literature and Language Press

2018


Dictionaries


The Comprehensive Dictionary of Persian Language (vol. II)

Academy of Persian Literature and Language Press

2017

Projects


Title: PTW1400

Goals:
PTW1400 (Persian TWeets of 1400) is an ongoing project in which we are building a corpus containing one milion Persian tweets of the year 1400 (Shamsi Hijri calendar). A part of the corpus (50,000 tweets; about one milion tokens) is manually normalized, tagged, and lemmatized using a new method of tag encoding named Tagframe.


Title: A Psycholinguistic Study of Verb Inflection

Goals:
In this project, we study how regular and irregular verbs are produced and perceived, which mechanisms (i.e. rule and rote) involve in the process of verb inflection, and how verbs are organized in the brain. Our primary focus is on past tense inflection.

Related Published Articles:

Words and Rules theory vs. Generative Phonology: A Study of the Past Tense Inflection of Persian Verbs


Title: CPVI (Comprehensive Persian Verb Inflector)

Goals:
CPVI (Comprehensive Persian Verb Inflector) is a Persian Verb Inflector. CPVI uses Dual Mechanism theory (Words & Rules theory) to inflect Persian verbs.


Title: Phonotactics, Phonostatistics & Syllabification of Persian Words

Goals:
In This Project we study phonotactics, phonostatistics, and syllabification of almost 55000 Persian words.

Related Published Articles:

Distribution of Nasals in Consonant Clusters; and the Distributional-phonological Characteristics of Loanwords in Persian


Title: Enhanced Flexicon

Goals:
Enhanced Flexicon is the edited and enhanced version of Flexicon corpus. IPA transcription, syllabification, adding initial glotal, and minor edits are some of the works that we have done in this project.


Certificates


ML & NLP


Introduction to TensorFlow for Artificial Intelligence, Machine Learning, and Deep Learning

DeepLearning.AI

K6Y4HLM8EWTS


Natural Language Processing in TensorFlow

DeepLearning.AI

L95YZVA24QBJ


Natural Language Processing with Classification and Vector Spaces

DeepLearning.AI

39PT7EBFHUPK


Introduction to TensorFlow for Artificial Intelligence, Machine Learning, and Deep Learning

DeepLearning.AI

K6Y4HLM8EWTS


Fine Tune BERT for Text Classification with TensorFlow

Coursera

9AWKHCK5HBV3


Introduction to Natural Language Processing in Python

Coursera

84KDYLAJ9C29


Transfer Learning for NLP with TensorFlow Hub

Coursera

2HCG8GAM48SK


Named Entity Recognition using LSTMs with Keras

Coursera

F5NBURAA7BFC


Fake News Detection with Machine Learning

Coursera

E985AAGH9FWZ


Convolutions for Text Classification with Keras

Coursera

EGSM7HZX9JBM


Tweet Emotion Recognition with TensorFlow

Coursera

S32NY3M76AFS


NLP: Twitter Sentiment Analysis

Coursera

NBTAPAJBH5ZE


Basic Sentiment Analysis with TensorFlow

Coursera

5F3TDXZU69DJ


Introduction to Natural Language Processing in Python

Coursera

84KDYLAJ9C29


The Data Scientist’s Toolbox

John Hopkins University

T9VVDG3LYA89


Programming


Introduction to Git and GitHub

Google

QXC4S4SUSJVX


Crash Course on Python

Google

TY2SKCFHLWRL


Troubleshooting and Debugging Techniques

Google

SMK6EJC3NKMW


Using Python to Interact with the Operating System

Google

9F2RU7Y8EUWZ


Create Python Linux Script to Generate a Disk Usage Report

Coursera

UN64QLE6XCPP


Automation Scripts Using Bash

Coursera

2PFRXWCYY7YD


Processing Data with Python

Coursera

4QHMXAVLR98M


Introduction to Bash Shell Scripting

Coursera

MGZUSQ524G2K


Practical Introduction to the Command Line

Coursera

N7DN3LC99FC3


Python for Everybody (Specialization)

University of Michigan

G9Y6UZHB7MVK


Python 3 Programming (Specialization)

University of Michigan

7LW7W2L6T5HC


Introduction to HTML5

University of Michigan

XVRPLQCGRHM7


Introduction to CSS3

University of Michigan

FWXCD8UREFVY


Statistics


Basic Statistics

University of Amsterdam

YMC6JZQFBTPG


Quantitative Methods

University of Amsterdam

DFLY7NQ2TLKW


Data Science Math Skills

Duke University

KDFY8FU862S7

Linguistics & Psychology


The Bilingual Brain

University of Houston

GNWHC4YT5U3B


Fundamental Neuroscience for Neuroimaging

John Hopkins University

DUKZ5M7CM6G9


Introduction to Psychology

University of Toronto

BYGC3ETHDDW6


Big data and Language 1

Korea Advanced Institute of Science and Technology(KAIST)

HNHFREK7VJ6T


Philosophy and the Sciences: Introduction to the Philosophy of Cognitive Sciences

The University of Edinburgh

WAQRB9MGC9LQ


My Skills


Python

95%

Web Crawling

100%

Git

85%

Tkinter

75%

HTML

80%

Languages


Persian

100%

English

100%
ERR