Abbas Safardoost

Work Experience

Freelance Translator & Researcher

2015 - Current

Lexicographer

2013 - 2015

Copy Writer

2010 - 2013

Education

Tarbiat Modares University

Sep 2019 - May 2025

Ph.D in Linguistics

Thesis Title:

Facing Persian in the Wild:
Corpus Compilation, Annotation, Tag Encoding, and Processing of Digital Writing Register

The proliferation of digital communication technologies has profoundly reshaped linguistic landscapes, giving rise to a distinct digital writing register that now constitutes a substantial component of daily interaction. Unlike traditional written registers governed by established prescriptive norms, the emergent register is characterized by a lack of central regulatory authority, positioning it as a 'wild' register within linguistic inquiry. This characteristic necessitates a critical re-evaluation of conventional methodologies for linguistic analysis and processing. This study proposes and applies a novel methodological framework designed specifically for engaging with this complex linguistic register. We revisit the processes of corpus compilation, preprocessing, encoding, and annotation, aiming to streamline fundamental processing tasks while rigorously preserving the inherent linguistic and orthographic diversity characteristic of digital writing. The proposed approach deliberately maintains variations during initial preprocessing stages, deferring the crucial task of word identity disambiguation to subsequent, more context-aware processing steps. Our methodology commenced with the extraction of a large-scale corpus, designated '1400', comprising one million random tweets sourced from the Persian calendar year 1400. A one-million- token sub-corpus was subsequently isolated to serve as the focus for detailed preprocessing, part- of-speech tagging, and lemmatization. During the normalization and tokenization phases, we implemented two primary strategies to enhance processing simplicity: the systematic removal of elements that introduce conventional orthographic variation (such as the zero-width non-joiner) and the strategic segmentation of multi-component elements possessing independent grammatical identities. For corpus annotation, we introduce 'Tagframe', a novel and flexible framework for tag encoding. The framework is engineered to facilitate the granular representation of linguistic, paralinguistic, and intra-lexical information (such as clitic structures). Leveraging Tagframe, we developed a standardized tagframe specifically tailored for Persian digital text. This involved a necessary revision of certain traditional grammatical concepts and the inclusion of new categories and layered features representing linguistic, internal structure, and paralinguistic characteristics. Lemmatization was performed utilizing diacritization to resolve homographic ambiguity. The final corpus annotation employed 25 linguistic categories supplemented by 82 linguistic features across three distinct layers: linguistic, intra-lexical, and paralinguistic. Statistical analysis of the annotated corpus, including comparison with other corpora, revealed a notably high frequency of hapax legomena, indicative of the corpus's rich diversity, and a high frequency of verbs, suggesting shared contextual characteristics with spoken language registers. The comprehensive methodology and the resulting annotated corpus presented herein establish a foundational resource and a replicable workflow that can significantly facilitate the creation and processing of similar corpora, thereby enabling further linguistic and computational investigations into this and related language registers.

Tarbiat Modares University

Sep 2013 - 2016

Master Degree in Linguistics

Thesis Title:

Inflection of Tense in Persian Verbs and How It Is Processed: a psycholinguistics Perspective

In this research, we studied the Persian inflectional tense forms and how they are processed. Since historical studies are dominant in the study of language in Persian, we dedicated the first two questions of this research to revise some of the concepts in the area. In the first question, we asked about the categorization of Persian verbs based on the concept of regularity. The results showed that there are three categories in Persian verbs: regular, irregular, and alternative. Regulars are an open category which their past tense are produced by adding the regular suffix (: -id) to root and don’t belong to the other categories. Irregulars are a close category which there are no systematic relation between their roots and past tense forms, and they are represented as a list of connected pairs; alternatives, which are a kind of regular verb, are an open category which doesn’t belong to irregulars, their roots consist of more than two syllables, end with -ɑn, and they use -d and -id suffixes to produce their past tense forms. In the second question, we asked about the model which represents the process of tense inflection. The results showed that the present tense of regulars and irregulars, in addition to the past tense of most of the regulars, are produced by retrieving the root and adding proper affixes. In the other hand, the past tense of irregulars and some of the high-frequency regulars are produced by retrieving of the past stem from memory. In the third question of this research, we asked about the mental mechanisms that involve in producing Persian verbs. Steven Pinker claims that the distinction between regulars and irregulars roots in two distinct mechanisms in which rule mechanism involves in producing the regulars and rote mechanism involves in producing irregulars. The results of this research confirm the Pinker’s claim. To examine these three questions we used a wide range of evidence, but the main source of the evidence came from an experiment in which we presented participants by the present tense of verbs and asked them to produce the past tense of these verbs as fast as possible. Then we gathered the errors that they made and used them as a source for our purposes.

Shiraz University

Sep 2009 - 2013

Bachelor Degree in Linguistics

Publication

Journal Articles

The Category and Structure of Persian Numerals

Iranian Journal of Comparative Linguistic Research

2024

10.22084/rjhll.2024.28776.2303

Words and Rules theory vs. Generative Phonology: A Study of the Past Tense Inflection of Persian Verbs

Iranian Journal of Comparative Linguistic Research

2022

10.22084/rjhll.2022.26069.2209

Distribution of Nasals in Consonant Clusters; and the Distributional-phonological Characteristics of Loanwords in Persian

Language Science

2020

10.22054/ls.2019.36693.1139

Encyclopedia Entries

Encyclopedia of Persian Language and Literature (vol. VI)

Vafa Zavvarei

Academy of Persian Literature and Language Press

2018

Dictionaries

The Comprehensive Dictionary of Persian Language (vol. II)

Academy of Persian Literature and Language Press

2017

Projects

Title: PTW1400

Goals:

PTW1400 (Persian TWeets of 1400) is an ongoing project in which we are building a corpus containing one milion Persian tweets of the year 1400 (Shamsi Hijri calendar). A part of the corpus (50,000 tweets; about one milion tokens) is manually normalized, tagged, and lemmatized using a new method of tag encoding named Tagframe.

Title: A Psycholinguistic Study of Verb Inflection

Goals:

In this project, we study how regular and irregular verbs are produced and perceived, which mechanisms (i.e. rule and rote) involve in the process of verb inflection, and how verbs are organized in the brain. Our primary focus is on past tense inflection.

Title: CPVI (Comprehensive Persian Verb Inflector)

Goals:

CPVI (Comprehensive Persian Verb Inflector) is a Persian Verb Inflector. CPVI uses Dual Mechanism theory (Words & Rules theory) to inflect Persian verbs.

Title: Phonotactics, Phonostatistics & Syllabification of Persian Words

Goals:

In This Project we study phonotactics, phonostatistics, and syllabification of almost 55000 Persian words.

Title: Enhanced Flexicon

Goals:

Enhanced Flexicon is the edited and enhanced version of Flexicon corpus. IPA transcription, syllabification, adding initial glotal, and minor edits are some of the works that we have done in this project.

Certificates

ML & NLP

Introduction to TensorFlow for Artificial Intelligence, Machine Learning, and Deep Learning

DeepLearning.AI

K6Y4HLM8EWTS

Natural Language Processing in TensorFlow

DeepLearning.AI

L95YZVA24QBJ

Natural Language Processing with Classification and Vector Spaces

DeepLearning.AI

39PT7EBFHUPK

Introduction to TensorFlow for Artificial Intelligence, Machine Learning, and Deep Learning

DeepLearning.AI

K6Y4HLM8EWTS

Fine Tune BERT for Text Classification with TensorFlow

Coursera

9AWKHCK5HBV3

Introduction to Natural Language Processing in Python

Coursera

84KDYLAJ9C29

Transfer Learning for NLP with TensorFlow Hub

Coursera

2HCG8GAM48SK

Named Entity Recognition using LSTMs with Keras

Coursera

F5NBURAA7BFC

Fake News Detection with Machine Learning

Coursera

E985AAGH9FWZ

Convolutions for Text Classification with Keras

Coursera

EGSM7HZX9JBM

Tweet Emotion Recognition with TensorFlow

Coursera

S32NY3M76AFS

NLP: Twitter Sentiment Analysis

Coursera

NBTAPAJBH5ZE

Basic Sentiment Analysis with TensorFlow

Coursera

5F3TDXZU69DJ

Introduction to Natural Language Processing in Python

Coursera

84KDYLAJ9C29

The Data Scientist’s Toolbox

John Hopkins University

T9VVDG3LYA89

Programming

Introduction to Git and GitHub

Google

QXC4S4SUSJVX

Crash Course on Python

Google

TY2SKCFHLWRL

Troubleshooting and Debugging Techniques

Google

SMK6EJC3NKMW

Using Python to Interact with the Operating System

Google

9F2RU7Y8EUWZ

Create Python Linux Script to Generate a Disk Usage Report

Coursera

UN64QLE6XCPP

Automation Scripts Using Bash

Coursera

2PFRXWCYY7YD

Processing Data with Python

Coursera

4QHMXAVLR98M

Introduction to Bash Shell Scripting

Coursera

MGZUSQ524G2K

Practical Introduction to the Command Line

Coursera

N7DN3LC99FC3

Python for Everybody (Specialization)

University of Michigan

G9Y6UZHB7MVK

Python 3 Programming (Specialization)

University of Michigan

7LW7W2L6T5HC

Introduction to HTML5

University of Michigan

XVRPLQCGRHM7

Introduction to CSS3

University of Michigan

FWXCD8UREFVY

Statistics

Basic Statistics

University of Amsterdam

YMC6JZQFBTPG

Quantitative Methods

University of Amsterdam

DFLY7NQ2TLKW

Data Science Math Skills

Duke University

KDFY8FU862S7

Linguistics & Psychology

The Bilingual Brain

University of Houston

GNWHC4YT5U3B

Fundamental Neuroscience for Neuroimaging

John Hopkins University

DUKZ5M7CM6G9

Introduction to Psychology

University of Toronto

BYGC3ETHDDW6

Big data and Language 1

Korea Advanced Institute of Science and Technology(KAIST)

HNHFREK7VJ6T

Philosophy and the Sciences: Introduction to the Philosophy of Cognitive Sciences

The University of Edinburgh

WAQRB9MGC9LQ

Work Experience

Freelance Translator & Researcher

2015 - Current

Lexicographer

2013 - 2015

Copy Writer

2010 - 2013

Education

Tarbiat Modares University

Sep 2019 - May 2025

Ph.D in Linguistics

Thesis Title: Facing Persian in the Wild: Corpus Compilation, Annotation, Tag Encoding, and Processing of Digital Writing Register

Tarbiat Modares University

Sep 2013 - 2016

Master Degree in Linguistics

Thesis Title: Inflection of Tense in Persian Verbs and How It Is Processed: a psycholinguistics Perspective

Shiraz University

Sep 2009 - 2013

Bachelor Degree in Linguistics

Publication

Journal Articles

Encyclopedia Entries

Dictionaries

Projects

Title: PTW1400

Goals:

Title: A Psycholinguistic Study of Verb Inflection

Goals:

Related Published Articles:

Title: CPVI (Comprehensive Persian Verb Inflector)

Goals:

Title: Phonotactics, Phonostatistics & Syllabification of Persian Words

Goals:

Related Published Articles:

Title: Enhanced Flexicon

Goals:

Certificates

ML & NLP

Programming

Statistics

Linguistics & Psychology

My Skills

Languages

Thesis Title:

Facing Persian in the Wild:
Corpus Compilation, Annotation, Tag Encoding, and Processing of Digital Writing Register

Thesis Title:

Inflection of Tense in Persian Verbs and How It Is Processed: a psycholinguistics Perspective