About Naki

List of research and engineering of NLP for American Native/Indigenous Languages.

About Naki

This page tries to assemble all the research on Natural Language Processing (NLP) for native and indigenous languages of the American continent. Our languages are in danger, especially if they don’t get involved in the new digital boom, that is introduced even into the most remote communities. Nevertheless, scientific and engineering work has been done in the field, much more work is necessary to archive usable tools that can compete with the products from the big companies (as Google Translate, Alexa, etc.). To push forward this effort, this work wants to generate an (as much as possible) complete list.

Our main aim is to encourage native speakers, researchers, and engineers to participate in this effort. Hopefully, we can do it with these survey.

If you want more information, please read our paper: “Challenges of language technologies for the indigenous languages of the Americas”. We also invite you to have a look at our presentation

Last Update: 22/November/2020

Machine Translation
Automatic Lexical extraction
Morphologcal analysis and segmentation
Corpus and digital resources
Speech Recognition
POS Tagging
Parsing
OCR
Spell checking
WordNet
Language ID
Code-Switching and Multilingual NLP
Tools, documentation and education
Computational Linguistic Analyze and Surveys
Contact

Corpus and digital resources

Online Corpus Resources

BriBri Anotated speech + morphology corpus
BriBri bilingual dictionary
Inukitut Morhological Database
JW300 Multilingual corpus that also include many indigenous languages of the american contienent. ( Soon available at OPUS )
Cherokee-English Parallel Corpus
English-Inuktitut Parallel Corpus
Nahuatl-Spanish, Axolot Parallel Nahuatl - Spanish
Gran diccionario Nahuatl
Wixarika-Spanish Parallel Wixarika - Spanish
Shipibo-Konibo Spanish Parallel corpus.
Shipibo-Konibo Wordnet
Shipibo-Konibo Lemma corpus
Shipibo-Konbio POS-tag corpus
Mapundung Speech and parallel corpus
Mexican Languages Parallel Corpus
Morphological reinflection (Navajo, Haida and Quechua)
Morpholigucal inflection SIGMPRPHON 2020 (Tlatepuzco Chinantec, San Pedro Amuzgo Amuzgos, Yoloxóchitl Mixtec, Chichicapan Zapotec, Yaitepec Chatino, Zenzontepec Chatino, Eastern Highland Chatino, Eastern Highland Otomi, Mezquital Otomi and Chichimec)
Siminchikkunarayku A Speech Corpus for Preservation of Southern Quechua
Tsunkua Spanish-Hñahñu (Otomi) parallel corpus.
Universal Dependencies: Mbya Guarani, Shipibo Konibo, Cusco Quechua
FastText: Nahuatl

Scientific papers

Chiruzzo, L., Amarilla, P., Ríos, A., & Lugo, G. G. (2020, May). Development of a Guarani-Spanish Parallel Corpus. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 2629-2633).
Duan, M., Fasola, C., Rallabandi, S. K., Vega, R., Anastasopoulos, A., Levin, L., & Black, A. W. (2020, May). A Resource for Computational Experiments on Mapudungun. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 2872-2877).
Martínez, G. S., Montaño, C., Bel-Enguix, G., Córdova, D., & Montoya, M. M. (2020, May). CPLM, a Parallel Corpus for Mexican Languages: Development and Interface. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 2947-2952).
Bustamante, G., Oncevay, A., & Zariquiey, R. (2020, May). No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 2914-2923).
Frey, B. (2020). “Data is nice:” Theoretical and pedagogical implications of an Eastern Cherokee corpus. LD&C Special Publication.
Duan, M., Fasola, C., Rallabandi, S. K., Vega, R. M., Anastasopoulos, A., Levin, L., & Black, A. W. (2019). A Resource for Computational Experiments on Mapudungun. arXiv preprint arXiv:1912.01772.
Agić, Ž., & Vulić, I. (2019, July). JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages. In Proceedings of the 57th Conference of the Association for Computational Linguistics (pp. 3204-3210).
Cynthia Montano, Gerardo Sierra, Gemma Bel-Enguix & Helena Gomez-Adorno (2019). A Mixtec-Spanish Parallel Corpus. WiNLP 2019 Workshop, Florence, Italy.
Ronald Cardenas, Rodolfo Zevallos,Reynaldo Baquerizo & Luis Camacho (2018) Siminchik: A Speech Corpus for Preservation of Southern Quechua. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan.
Katherine M. Schmirler, Antti Arppe, Atticus G. Harrigan & Antti Arppe. (2017). A Morphologically Tagged Corpus for Plains Cree. 4th Prairie Workshop on Language and Linguistics, University of Saskatchewan, March 18, 2017.
Kazeminejad, G., Cowell, A., & Hulden, M. (2017). Creating lexical resources for polysynthetic languages—the case of Arapaho. ComputEL-2, 10. (Arapaho)
Sofia Margarita Flores Solórzano. (2017). Un primer corpus pandialectal oral de la lengua bribri y su anotacion morfológica con base en el modelo de estados finitos. Ph.D. thesis, Universidad Autónoma de Madrid.
Cavar, M., Cavar, D., & Cruz, H. (2016). Endangered Language Documentation: Bootstrapping a Chatino Speech Corpus, Forced Aligner, ASR. In LREC.
Bell, L., & Bell, L. (2017). Work With What You’ve Got. ComputEL-2, 48. (haida)
Galarreta Asian, A. P. (2017). Generación de corpus paralelos para la implementación de un traductor automático estadístico entre shipibo-konibo y español. Pontificia Universidad Católica del Perú.
Gutierrez-Vasques, X., Sierra, G., & Pompa, I. H. (2016). Axolotl: a Web Accessible Parallel Corpus for Spanish-Nahuatl. In LREC.
Hoenen, A. (2016). Wikipedia Titles As Noun Tag Predictors. In LREC.
Christodouloupoulos, C., & Steedman, M. (2015). A massively parallel corpus: the Bible in 100 languages. Language resources and evaluation, 49(2), 375-395.
Goldhahn, D., Eckart, T., & Quasthoff, U. (2012). Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In LREC (pp. 759-765).
Rios, A., Göhring, A., & Volk, M. (2008). A Quechua-Spanish parallel treebank. LOT Occasional Series, 12, 53-64.
Monson, C., Levin, L., Vega, R. M., Brown, R. D., Font-Llitjos, A., Lavie, A., … & Huisca, R. (2004). Data Collection and Analysis of Mapudungun Morphology for Spelling Correction. Computer Science Department, 300.
Martin, J., Johnson, H., Farley, B., & Maclachlan, A. (2003). Aligning and Using an English-Inuktitut Parallel Corpus. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond.

Machine Translation

Online demos and software

CHANA A software platform for automatic translation between Peruvian native languages
Mainumby is an experimental translation app for the Guarani-Spanish language pair.
Microsoft Translator includes Yucatec Maya and Queretaro Otomí.
Wayuu-Spanish Machine Translation Author: José Cirilo González Hernández
Wixarika-Spanish Machine Translation Author: Jesús Manuel Mager Hois
Zapotec-Spanish Tranlsation APP. Author: Gonazlo Santiago Martínez.

Scientific papers

Isaac Feldman and Rolando Coto-Solano. (2020). Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri. COLING.
Zhang, S., Frey, B., & Bansal, M. (2020). ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization. EMNLP 2020.
Le, T. N., & Sadat, F. (2020, October). Low-Resource NMT: an Empirical Study on the Effect of Rich Morphological Word Segmentation on Inuktitut. In Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (AMTA 2020) (pp. 165-172).
Gómez Montoya, H. E. (2019). A crowd-powered conversational assistant for the improvement of a neural machine translation system in native peruvian language. Pontificia Universidad Católica del Perú.
Gasser, M. (2018). Mainumby: un Ayudante para la Traducci'on Castellano-Guaran'i. Xiv preprint arXiv:1810.08603.
Mager, M., Mager, E., Medina-Urrea, A., Ruiz, I. V. M., & Kann, K. (2018). Lost in Translation: Analysis of Information Loss During Machine Translation Between Polysynthetic and Fusional Languages. In Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages (pp. 73-83).
Micher, J. (2018). Using the Nunavut Hansard Data for Experiments in Morphological Analysis and Machine Translation. In Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages (pp. 65-72).
Mager, M., & Meza, I. (2018). Hacia la traducción automática de las lenguas indıgenas de méxico. In Proceedings of the 2018 Digital Humanities Conference. The Association of Digital Humanities Organizations.
Huang, G., da Silva, T. F., Lamel, L., Gauvain, J. L., Gorin, A., Laurent, A., … & Messouadi, A. (2017, March). An investigation into language model data augmentation for low-resourced STT and KWS. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5790-5794). IEEE. (Guaraní)
Galarreta, A. P., Melgar, A. & Ocevay, A. (September, 2017). Corpus Creation and Initial SMT Experiments between Spanish and Shipibo-konibo. In RANLP 2017 (p. 238-244).
Mager Hois, Jesús Manuel. (2017). Traductor híbrido wixárika - español con escasos recursos bilingües. Universidad Autónoma Metropolitana.
Rios, A. (2016). A basic language technology toolkit for Quechua.
González Hernández, José C. (2016). Herramienta de traduccin automática de wayuunaiki a español. Caso de estudio: sintagmas nominales y verbales simples. Universidad de Zulia.
Mager Hois, J. M., Barrón Romero, C., & Meza Ruiz, I. V. (2016). Traductor estadístico wixarika-español usando descomposición morfológica. In COMTEL 2016.
Coler M. and Homola Petr. (2014) Ruble-based machine translation for Aymara, in book Endangered Languages and New Technologies, Chapter: 4, Publisher: Cambridge University Press, Editors: Mari Jones, pp.67-80
Uchamaco, G. R. L., Vilca, H. D. C., & Mariño, F. C. C. (2013). Incubación de Sistema de Traducción Automática Español a Quechua, Basado en la Plataforma Libre y Código Abierto Apertium. Ceprosimad, 2(1), 57-65.
Fernández, D. I., Gamboa, O. Q., Atencia, J. M., & Bedoya, O. E. H. (2013, May). Design and implementation of an “Web API” for the automatic translation Colombia’s language pairs: Spanish-Wayuunaiki case. In Communications and Computing (COLCOM), 2013 IEEE Colombian Conference on (pp. 1-9). IEEE.
Rudnick, A., & Gasser, M. (2013). Lexical selection for hybrid mt with sequence labeling. In Proceedings of the Second Workshop on Hybrid Approaches to Translation (pp. 102-108).
Vilca, Hugo David Calderon. (2009) Traductor automático en línea del español al quechua, basado en la plataforma libre y código abierto Apertum. . Revista de Investigaciones (Puno)-Escuela de Posgrado de la UNA PUNO, vol. 5, no 3.
Vilca, H. D. C. (2009). Traductor automático en línea del español a quechua, basado en la plataforma libre y código abierto APERTIUM. Revista de Investigaciones (Puno)-Escuela de Posgrado de la UNA PUNO, 5(3).
Castro Cavero, Indhira. (2007)Traductor morfológico del castellano y quechua. Revista I+i, vol. 1, no. 1.
Gasser, M. (2006). Machine translation and the future of indigenous languages. In I Congreso Internacional de Lenguas y Literaturas Indoamericanas, Temuco, Chile.
Abdelali, A., Cowie, J., Helmreich, S., Jin, W., Milagros, M. P., Ogden, B., … & Zacharski, R. (2006, August). Guarani: a case study in resource development for quick ramp-up MT. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas,“Visions for the Future of Machine Translation (pp. 1-9).
Monson, C., Llitjós, A. F., Aranovich, R., Levin, L., Brown, R., Peterson, E., … & Lavie, A. (2006). Building NLP systems for two resource-scarce indigenous languages: mapudungun and Quechua. Strategies for developing machine translation for minority languages, 15.
Schafer, C., & Drábek, E. F. (2005, June). Models for inuktitut-english word alignment. In Proceedings of the acl workshop on building and using parallel texts (pp. 79-82). Association for Computational Linguistics.
Langlais, P., Gotti, F., & Cao, G. (2005, June). Nukti: English-inuktitut word alignment system description. In Proceedings of the ACL Workshop on Building and Using Parallel Texts (pp. 75-78). Association for Computational Linguistics.
Lopez, A., & Resnik, P. (2005, June). Improved HMM alignment models for languages with scarce resources. In Proceedings of the ACL Workshop on Building and Using Parallel Texts (pp. 83-86). Association for Computational Linguistics.
Martin, J., Mihalcea, R., & Pedersen, T. (2005, June). Word alignment for languages with scarce resources. In Proceedings of the ACL Workshop on Building and Using Parallel Texts (pp. 65-74). Association for Computational Linguistics.
Llitjós, A. F., Aranovich, R., & Levin, L. (2005). Building Machine translation systems for indigenous languages. In Second Conference on the Indigenous Languages of Latin America (CILLA II), Texas, USA.

Automatic Lexical Extraction

Scientific papers and dictionaries

Hunt, B., Chen, E., Schreiner, S. L., & Schwartz, L. (2019). Community lexical access for an endangered polysynthetic language: An electronic dictionary for St. Lawrence Island Yupik. In NAACL-HLT 2019, 122.
Gutierrez-Vasques, X., & Mijangos, V. (2017). Low-resource bilingual lexicon extraction using graph based word embeddings. arXiv preprint arXiv:1710.02569.
Gutierrez, Ximena. (2015, June). Bilingual lexicon extraction for a distant language pair using a small parallel corpus. In NAACL-HLT 2015 Student Research Workshop (SRW) (p. 154).
Lam, K. N., Al Tarouti, F., & Kalita, J. (2014). Creating lexical resources for endangered languages. ACL 2014, 54. (Cherokee and Cheyenne)

Morphologcal analysis and segmentation

Software

Scientific Papers

Mager, Manuel, Özlem Çetinoğlu, and Katharina Kann. (2020) “Tackling the Low-resource Challenge for Canonical Segmentation.” EMNLP2020.
Kann, Katharina, McCarthy, Arya D., Nicolai, Garrett, and Hulden, Mans. “The SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm Completion”, SIGMORPHON 2020.
Vylomova, E., White, J., Salesky, E., Mielke, S. J., Wu, S., Ponti, E., … & Tyers, F. (2020). SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection. SIGMORPHON 2020, 1.
Klyachko, E., Sorokin, A., Krizhanovskaya, N., Krizhanovsky, A., & Ryazanskaya, G. (2020). LowResourceEval-2019: a shared task on morphological analysis for low-resource languages. arXiv preprint arXiv:2001.11285.
Eskander, R., Callejas, F., Nichols, E., Klavans, J. L., & Muresan, S. (2020, May). MorphAGram, Evaluation and Framework for Unsupervised Morphological Segmentation. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 7112-7122).
Cruz, H., Stump, G., & Anastasopoulos, A. (2020). A Resource for Studying Chatino Verbal Morphology. arXiv preprint arXiv:2004.02083.
Sofía Flores Solórzano. 2019. La modelización de la morfología verbal bribri - modeling the verbal morphology of bribri. Revista de Procesamiento del Lenguaje Natural, 62:85–92. DOI 10.26342/2019-62-10.
Eskander, R., Klavans, J. & Muresan S. (2019) Unsupervised Morphological Segmentation for Low-Resource Polysynthetic Languages. In Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 189–195).
Sorokin, A. (2019). Convolutional neural networks for low-resource morpheme segmentation: baseline or state-of-the-art? In Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 154–159).
Micher, J. (2019). Bootstrapping a Neural Morphological Generator from Morphological Analyzer Output for Inuktitut. In Proceedings of the Workshop on Computational Methods for Endangered Languages (Vol. 2, No. 1, p. 7).
Escobar Farfan, J. I. (2019). Nahuatl contemporary writing: studying convergence in the absence of a written norm (Doctoral dissertation, University of Sheffield).
Chen, Emily, and Lane Schwartz. (2018) A morphological analyzer for St. Lawrence Island/Central Siberian Yupik. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). 2018.
Cardenas, R. & Zeman, D. (2018) A Morphological Analyzer for Shipibo-Konibo. Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 131-139).
Moeller, S., & Hulden, M. (2018). Automatic Glossing in a Low-Resource Setting for Language Documentation. In Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages (pp. 84-93).
Moeller, S., Kazeminejad, G., Cowell, A., & Hulden, M. (2018). A Neural Morphological Analyzer for Arapaho Verbs Learned from a Finite State Transducer. In Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages (pp. 12-20).
Littell, P. (2018). Finite-state morphology for Kwak’wala: A phonological approach. In Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages (pp. 21-30).
Andriyanets, V., & Tyers, F. (2018). A prototype finite-state morphological analyser for Chukchi. In Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages (pp. 31-40).
Kazantseva, A., Maracle, O. B., & Pine, A. (2018). Kawennón: nis: the Wordmaker for Kanyen’kéha. In Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages (pp. 53-64).
Kann, K., Mager, M., Meza-Ruiz, I., & Schütze, H. (2018). Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages. 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies NAACL-HLT, New Orleans, Louisiana, US. June, 2018.
Antti Arppe, Christopher Cox, Mans Hulden, Jordan Lachler, Sjur N. Moshagen, Miikka Silfverberg & Trond Trosterud (2017) Computational modeling of the verb in Dene languages. The case of Tsuut’ina. Working Papers in Athabascan Linguistics (“Red Book” series), Alaska Native Language Center, pp. 51-68.
Harrigan, A. G., Schmirler, K., Arppe, A., Antonsen, L., Trosterud, T., & Wolvengrey, A. (2017). Learning from the computational modelling of Plains Cree verbs. Morphology, 27(4), 565-598.
Cotterell, R., Kirov, C., Sylak-Glassman, J., Walther, G., Vylomova, E., Xia, P., & Faruqui, M. (2017). CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages. CoNLL SIGMORPHON, 1. (Haida, Navajo, Quechua)
Bowers, D., Arppe, A., Lachler, J., Moshagen, S., & Trosterud, T. (2017). A Morphological Parser for Odawa. ComputEL-2, 1.
Micher, J. C. (2017). Improving Coverage of an Inuktitut Morphological Analyzer Using a Segmental Recurrent Neural Network. ComputEL-2, 101. (Inuktitut)
Arppe, A., Junker, M. O., & Torkornoo, D. (2017). Converting a comprehensive lexical database into a computational model: The case of East Cree verb inflection. ComputEL-2, 52. (Cree)
Sylak-Glassman, J., Kirov, C., & Yarowsky, D. (2016). Remote Elicitation of Inflectional Paradigms to Seed Morphological Analysis in Low-Resource Languages. In LREC. (Also includes some experiments with Nahuatl)
Christopher Cox, Måns Huldén, Miikka Silfverberg, Jordan Lachler, Sally Rice, Sjur N. Moshagen, Trond Trosterud & Antti Arppe (2016) Computational modeling of the verb in Dene languages. The case of Tsuut’ina. Dene Languages Conference, Yellowknife, North-West Territories, Canada, 6-7 June 2016.
Snoek, C., Thunder, D., Loo, K., Arppe, A., Lachler, J., Moshagen, S., & Trosterud, T. (2014, June). Modeling the noun morphology of Plains Cree. In Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages (pp. 34-42). (Cree)
Gonzales, A. R., & Mamani, R. A. C. (2014). Morphological Disambiguation and Text Normalization for Southern Quechua Varieties. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (pp. 39-47).
Dunham, J., Cook, G., & Horner, J. (2014). LingSync & the Online Linguistic Database: New models for the collection and management of data for language communities, linguists and language learners. In Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages (pp. 24-33).
Assini, A. A. (2013). Natural language processing and the Mohawk language: creating a finite state morphological parser of Mohawk formal nouns. University of Limerik.
Vilca, C., David, H., Mariñó, C., Cagniy, F., & Mamani Calderon, E. F. (2013). Analizador morfológico de la lengua quechua basado en software libre helsinkifinite-statetransducer (hfst).
Nicholson, J., Cohn, T., & Baldwin, T. (2012). Evaluating a Morphological Analyser of Inuktitut. NAACL HLT 2012, 372.
Martínez-Gil, C., Zempoalteca-Pérez, A., Soancatl-Aguilar, V., de Jesús Estudillo-Ayala, M., Lara-Ramírez, J. E., & Alcántara-Santiago, S. (2012). Computer Systems for Analysis of Nahuatl. Research in Computing Science, 47, 11-16.
Gasser, M. (2011). Computational morphology and the teaching of indigenous languages. In Indigenous Languages of Latin America—Actas del Primer Simposio sobre Ensenanza de Lenguas Indıgenas de América Latina (pp. 52-61).
Porta, A. O. (2010, July). The use of formal language models in the typology of the morphology of Amerindian languages. In Proceedings of the ACL 2010 Student Research Workshop (pp. 109-113). Association for Computational Linguistics. (Toba and Quichua)
Medina-Urrea, A. (2008). Affix discovery based on entropy and economy measurements. Computational Linguistics for Less-Studied Languages. Texas Linguistics Society 10, 99-112.
Medina-Urrea, A. (2007). Affix discovery by means of corpora: Experiments for Spanish, Czech, Ralámuli and Chuj. In Aspects of Automatic Text Analysis (pp. 277-299). Springer, Berlin, Heidelberg.
Wolfart, H. C., & Pardo, F. (1973). Computer-assisted linguistic analysis, University of Manitoba Anthropology Papers.

Speech

Cruz H. and Waring J. (2019). Deploying Technology to Save Endangered Languages. arXiv preprint arXiv:1908.08971.
Klavans, J., Morgan, J., LaRocca, S., Micher, J., & Voss, C. (2018). Challenges in Speech Recognition and Translation of High-Value Low-Density Polysynthetic Languages. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 2: User Papers) (Vol. 2, pp. 283-293).
Adams, O., Cohn, T., Neubig, G., & Michaud, A. (2017, December). Phonemic transcription of low-resource tonal languages. In Proceedings of the Australasian Language Technology Association Workshop 2017 (pp. 53-60).
Rolando Coto-Solano and Sofía Flores Solórzano. (2017). Comparison of two forced alignment systems for aligning Bribri speech. CLEI Electron. J., 20(1):2–1. DOI: 10.19153/cleiej.20.1.2.
Rolando Coto-Solano and Sofía Flores Solórzano. 2016. Alineación forzada sin entrenamiento para la anotación automática de corpus orales de las lenguas indígenas de Costa Rica. Kánina, 40(4):175–199. DOI 10.15517/RK.V40I4.30234.
Mendels, G., Cooper, E., & Hirschberg, J. (2016). Babler-data collection from the web to support speech recognition and keyword search. In Proceedings of the 10th Web as Corpus Workshop (pp. 72-81).
Maldonado, D. M., Villalba Barrientos, R., & Pinto-Roa, D. P. (2016, November). Eñe’˜ e: Sistema de reconocimiento automático del habla en Guaraní. In Simposio Argentino de Inteligencia Artificial (ASAI 2016)-JAIIO 45 (Tres de Febrero, 2016)..
DiCanio, C., Nam, H., Whalen, D. H., Timothy Bunnell, H., Amith, J. D., & García, R. C. (2013). Using automatic alignment to analyze endangered language data: Testing the viability of untrained alignment. The Journal of the Acoustical Society of America, 134(3), 2235-2246.(For Mixtec)
Urrea, A. M., Camacho, J. A. H., & Garcıa, M. A. Towards the Speech Synthesis of Raramuri: A Unit Selection Approach based on Unsupervised Extraction of Suffix Sequences. Advances in Computational Linguistics, 243.
Nolazco-Flores, J. A., Salgado-Garza, L. R., & Peña-Díaz, M. (2005, June). Speaker dependent ASRs for huastec and western-huastec náhuatl languages. In Iberian Conference on Pattern Recognition and Image Analysis (pp. 595-602). Springer, Berlin, Heidelberg.

POS Tagging

Jose Pereira-Noriega, Rodolfo Mercado-Gonzales, Andrés Melgar, Marco Sobrevilla-Cabezudo, & Arturo Oncevay-Marcos. (2017). Ship-LemmaTagger: building an NLP toolkit for a peruvian native language. In Text, Speech, and Dialogue: 20th International Conference, TSD 2017. Springer.

Parsing

Vasquez, A., Aguirre, R. E., Angulo, C., Miller, J., Villanueva, C., Agić, Ž., … & Oncevay, A. (2018). Toward Universal Dependencies for Shipibo-Konibo. In Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) (pp. 151-161).
Homola, P. (2011, September). Parsing a Polysynthetic Language. In RANLP (pp. 562-567). (Ayamara)
Agić, Ž., Johannsen, A., Plank, B., Martínez, H. A., Schluter, N., & Søgaard, A. (2016). Multilingual projection for parsing truly low-resource languages. Transactions of the Association for Computational Linguistics, 4, 301.

OCR

Maxwell, M., & Bills, A. (2017). Endangered Data for Endangered Languages: Digitizing Print dictionaries. ComputEL-2, 85. (Tzeltal, Muinane, Cubeo)
Garrette, D., & Alpert-Abrams, H. (2016). An Unsupervised Model of Orthographic Variation for Historical Document Transcription. In HLT-NAACL (pp. 467-472).
Hubert, I., Arppe, A., Lachler, J., & Santos, E. A. (2016). Training & quality assessment of an optical character recognition model for Northern Haida. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 3227-3234).

Spell checking

Carlo Alva and Arturo Oncevay-Marcos. (2017). Spell-checking based on syllabification and character-level graphs for a peruvian agglutinative language. In Proceedings of the EMNLP 2017 Workshop on Subword & Character Level Models in NLP, SCLeM 2017. ACL Anthology.
Schwartz, L., & Chen, E. (2017). Liinnaqumalghiit: A Web-based Tool for Addressing Orthographic Transparency in St. Lawrence Island/Central Siberian Yupik. Language documentation and conservation, 11.

WordNet

Valencia, D. M., Oncevay-Marcos, A., & Cabezudo, M. A. S. (2018). WordNet-Shp: Towards the Building of a Lexical Database for a Peruvian Minority Language. In LREC.

Language ID

Espichán-Linares, A., & Oncevay-Marcos, A. (2017, September). Language Identification with Scarce Data: A Case Study from Peru. In Annual International Symposium on Information Management and Big Data (pp. 90-105). Springer, Cham.
Alexandra Espichán-Linares and Arturo Oncevay-Marcos. 2017. A low-resourced peruvian language identification model. In Proceedings of the SIMBig 2017 Track on Applied Natural Language Processing, ANLP 2017. Springer.
Xia, F., Lewis, C., & Lewis, W. D. (2010). The Problems of Language Identification within Hugely Multilingual Data Sets. In LREC.

Code-Switching and Multilingual NLP

Mager, M., Çetinoğlu, Ö., & Kann, K. (2019, June). Subword-Level Language Identification for Intra-Word Code-Switching. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 2005-2011).
Dufter, P., Zhao, M., Schmitt, M., Fraser, A., & Schütze, H. (2018). Embedding Learning Through Multilingual Concept Induction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1520-1530).
Asgari, E., & Schütze, H. (2017). Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 113-124).
Garrette, D., Alpert-Abrams, H., Berg-Kirkpatrick, T., & Klein, D. (2015). Unsupervised Code-Switching for Multilingual Historical Document Transcription. In HLT-NAACL (pp. 1036-1041).
King, B., & Abney, S. (2013). Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1110-1119). (Also nahutal mixed sentences)

Tools, documentation and education

Online available software

Kirrkirr: software for the exploration of indigenous language dictionaries

Papers

Mercado-Gonzales, R., Pereira-Noriega, J., Cabezudo, M. A. S., & Oncevay-Marcos, A. (2018). ChAnot: An intelligent annotation tool for indigenous and highly agglutinative languages in Peru.. In LREC.
Flores Solórzano, S. (2012). Teclado chibcha: un software lingüístico para los sistemas de escritura de las lenguas bribri y cabécar. Revista de Filología y Lingüística de la Universidad de Costa Rica Vol. 36 Núm. 2.
Kuhn, J. (2004). Applying computational linguistic techniques in a documentary project for Q’anjob’al (Mayan, Guatemala). In In Proceedings of LREC 2004.
Lessard, G., Brinklow, N., & Levison, M. (2018). Natural Language Generation for Polysynthetic Languages: Language Teaching and Learning Software for Kanyen’kéha (Mohawk). In Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages (pp. 41-52).
Sofía Flores Solórzano. (2010). Teclado chibcha: Un software lingüístico para los sistemas de escritura de las lenguas bribri y cab´ecar. Revista de Filología y Lingüística de la Universidad de Costa Rica, 36(2):155–161. DOI 10.15517/RFL.V36I2.1110. https://revistas.ucr.ac.cr/index.php/filyling/article/view/1110
Manning, C. D., Jansz, K., & Indurkhya, N. (2001). Kirrkirr: Software for browsing and visual exploration of a structured Warlpiri dictionary. Literary and Linguistic Computing, 16(2), 135-151.
Jansz, K., Manning, C., & Indurkhya, N. (1999). Kirrkirr: Interactive visualisation and multimedia from a structured Warlpiri dictionary. Proceedings of AusWeb99, the Fifth Australian World Wide Web Conference, pp. 302-316.

Computational Linguistic Analyze and Surveys

Aguilar, C., & Acosta, O. A Critical Review of the Current State of Natural Language Processing in Mexico and Chile. (2020) Natural Language Processing for Global and Local Business, 365-389.
Schwartz, L., Tyers, F., Levin, L., Kirov, C., Littell, P., Lo, C. K., … & Strunk, L. (2020). Neural Polysynthetic Language Modelling. arXiv preprint arXiv:2005.05477.
Gupta, V., & Boulianne, G. (2020, May). Automatic Transcription Challenges for Inuktitut, a Low-Resource Polysynthetic Language. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 2521-2527).
Vera J. and Palma W. (2020) Laplacian spectrum approach to linguistic complexity: a casestudy on indigenous languages of the America. EPL (Europhysics Letters).
van Esch, D., Foley, B., & San, N. (2019). Future Directions in Technological Support for Language Documentation. In Proceedings of the Workshop on Computational Methods for Endangered Languages (Vol. 1, No. 1, p. 3).
Maheshwari, A., Bouscarrat, L., & Cook, P. (2018, May). Towards Language Technology for Mi’kmaq. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).
Camacho, L., & Zevallos, R. (2018)Siminchikkunarayku. Fundación Siminchikkunarayku, Pontificia Universidad Católica del Perú.
Mager, M., Gutierrez-Vasques, X., Sierra, G., & Meza, I. (2018). Challenges of language technologies for the indigenous languages of the Americas. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 55–69).
Littell, P., Kazantseva, A., Kuhn, R., Pine, A., Arppe, A., Cox, C., & Junker, M. O. (2018). Indigenous language technologies in Canada: Assessment, challenges, and successes. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 2620-2632).
Gutierrez-Vasques, X., & Mijangos, V. (2018). Comparing morphological complexity of Spanish, Otomi and Nahuatl. In Proceedings of the Workshop on Linguistic Complexity and Natural Language Processing (pp. 30-37).
Klavans, J. L. (2018). Computational Challenges for Polysynthetic Languages. In Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages (pp. 1-11).

Contact

This effort can be completed only with the cooperation of all visitors. If you know about some work in this field, please let me know and push to this repositoy or send an email to mmager [at] turing.iimas.unam.mx or visit my personal web page.

How to cite

If you found this information usefull for your academic research please acknowledge its use with a citation:

Mager, M., Gutierrez. X., Sierra, G., and Meza, I. (2018, August). Challenges of language technologies for the Americas indigenous languages. In Proceedings of the 27th international conference on Computational linguistics. Association for Computational Linguistics.

@InProceedings{C18-1006,
  author = 	"Mager, Manuel
		and Gutierrez-Vasques, Ximena
		and Sierra, Gerardo
		and Meza-Ruiz, Ivan",
  title = 	"Challenges of language technologies for the indigenous languages of the Americas",
  booktitle = 	"Proceedings of the 27th International Conference on Computational Linguistics",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"55--69",
  location = 	"Santa Fe, New Mexico, USA",
  url = 	"http://aclweb.org/anthology/C18-1006"
}

About Naki

List of research and engineering of NLP for American Native/Indigenous Languages.

About Naki

Table of Contents

Corpus and digital resources

Machine Translation

Automatic Lexical Extraction

Morphologcal analysis and segmentation

Speech

POS Tagging

Parsing

OCR

Spell checking

WordNet

Language ID

Code-Switching and Multilingual NLP

Tools, documentation and education

Computational Linguistic Analyze and Surveys

Contact

How to cite

Links