Machine learning for text classification in classical cryptography

dc.contributor.authorFoxon, Floe
dc.contributor.editorAntal, Eugen
dc.contributor.editorMarák, Pavol
dc.date.accessioned2025-05-16T13:05:23Z
dc.date.available2025-05-16T13:05:23Z
dc.date.issued2025
dc.description.abstractThis study furthers previous work on text classification to distinguish between ciphertext and gibberish. The statistical/linguistic properties of four text types were studied: meaningful English text, and three gibberish types (n=1,250 each; total N=5,000). Dimension reduction techniques (PCA, t-SNE, and UMAP) were used to reduce the statistical/linguistic feature space of the texts to two dimensions, revealing distinct regions of (lower dimensional) feature space occupied by each text, with some overlap. Machine learning models including random forests, neural networks (NNs), and support vector machines (SVMs) were used to classify the four text types based on their statistical/linguistic properties. Nested cross-validation revealed better generalization performance for the NNs and SVMs, classifying texts with >90% accuracy. Applied to the Dorabella cryptogram, the models suggest that this text resembles meaningful English text more closely than gibberish types, which comports with the Dorabella cryptogram as a monoalphabetic substitution cipher, but this classification should be interpreted with caution. Features that better separate meaningful English from English-like gibberish are needed, and other encryption schemes/cryptograms should be explored with these methods.
dc.identifier.issn1736-6305
dc.identifier.urihttps://hdl.handle.net/10062/109745
dc.language.isoen
dc.publisherTartu University Library
dc.relation.ispartofseriesNEALT Proceedings Series 58
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectMachine learning
dc.subjectNeural network
dc.subjectSupport vector machine
dc.subjectRandom forest
dc.subjectClassification
dc.subjectDimension reduction
dc.subjectCryptogram
dc.subjectDorabella
dc.subjectSubstitution cipher
dc.titleMachine learning for text classification in classical cryptography
dc.typeArticle

Failid

Originaal pakett

Nüüd näidatakse 1 - 1 1
Laen...
Pisipilt
Nimi:
7.pdf
Suurus:
1.59 MB
Formaat:
Adobe Portable Document Format