The proliferation of digital communication technologies has profoundly reshaped linguistic landscapes, giving rise to a distinct digital writing register that now constitutes a substantial component of daily interaction. Unlike traditional written registers governed by established prescriptive norms, the emergent register is characterized by a lack of central regulatory authority, positioning it as a 'wild' register within linguistic inquiry. This characteristic necessitates a critical re-evaluation of conventional methodologies for linguistic analysis and processing. This study proposes and applies a novel methodological framework designed specifically for engaging with this complex linguistic register. We revisit the processes of corpus compilation, preprocessing, encoding, and annotation, aiming to streamline fundamental processing tasks while rigorously preserving the inherent linguistic and orthographic diversity characteristic of digital writing. The proposed approach deliberately maintains variations during initial preprocessing stages, deferring the crucial task of word identity disambiguation to subsequent, more context-aware processing steps. Our methodology commenced with the extraction of a large-scale corpus, designated '1400', comprising one million random tweets sourced from the Persian calendar year 1400. A one-million- token sub-corpus was subsequently isolated to serve as the focus for detailed preprocessing, part- of-speech tagging, and lemmatization. During the normalization and tokenization phases, we implemented two primary strategies to enhance processing simplicity: the systematic removal of elements that introduce conventional orthographic variation (such as the zero-width non-joiner) and the strategic segmentation of multi-component elements possessing independent grammatical identities. For corpus annotation, we introduce 'Tagframe', a novel and flexible framework for tag encoding. The framework is engineered to facilitate the granular representation of linguistic, paralinguistic, and intra-lexical information (such as clitic structures). Leveraging Tagframe, we developed a standardized tagframe specifically tailored for Persian digital text. This involved a necessary revision of certain traditional grammatical concepts and the inclusion of new categories and layered features representing linguistic, internal structure, and paralinguistic characteristics. Lemmatization was performed utilizing diacritization to resolve homographic ambiguity. The final corpus annotation employed 25 linguistic categories supplemented by 82 linguistic features across three distinct layers: linguistic, intra-lexical, and paralinguistic. Statistical analysis of the annotated corpus, including comparison with other corpora, revealed a notably high frequency of hapax legomena, indicative of the corpus's rich diversity, and a high frequency of verbs, suggesting shared contextual characteristics with spoken language registers. The comprehensive methodology and the resulting annotated corpus presented herein establish a foundational resource and a replicable workflow that can significantly facilitate the creation and processing of similar corpora, thereby enabling further linguistic and computational investigations into this and related language registers.
In this research, we studied the Persian inflectional tense forms and how they are processed. Since historical studies are dominant in the study of language in Persian, we dedicated the first two questions of this research to revise some of the concepts in the area. In the first question, we asked about the categorization of Persian verbs based on the concept of regularity. The results showed that there are three categories in Persian verbs: regular, irregular, and alternative. Regulars are an open category which their past tense are produced by adding the regular suffix (: -id) to root and don’t belong to the other categories. Irregulars are a close category which there are no systematic relation between their roots and past tense forms, and they are represented as a list of connected pairs; alternatives, which are a kind of regular verb, are an open category which doesn’t belong to irregulars, their roots consist of more than two syllables, end with -ɑn, and they use -d and -id suffixes to produce their past tense forms. In the second question, we asked about the model which represents the process of tense inflection. The results showed that the present tense of regulars and irregulars, in addition to the past tense of most of the regulars, are produced by retrieving the root and adding proper affixes. In the other hand, the past tense of irregulars and some of the high-frequency regulars are produced by retrieving of the past stem from memory. In the third question of this research, we asked about the mental mechanisms that involve in producing Persian verbs. Steven Pinker claims that the distinction between regulars and irregulars roots in two distinct mechanisms in which rule mechanism involves in producing the regulars and rote mechanism involves in producing irregulars. The results of this research confirm the Pinker’s claim. To examine these three questions we used a wide range of evidence, but the main source of the evidence came from an experiment in which we presented participants by the present tense of verbs and asked them to produce the past tense of these verbs as fast as possible. Then we gathered the errors that they made and used them as a source for our purposes.
2024
2020
2018
2017
Python
Web Crawling
Git
Tkinter
HTML
Persian
English