Campos aleatórios condicionais na segmentação de textos por idioma

Autores

  • Robin Cabeza Ruiz Universidad de Holguín

DOI:

https://doi.org/10.18046/syt.v15i43.2712

Palavras-chave:

Segmentação de textos por idiomas, campos aleatórios condicionais.

Resumo

 Neste artigo, é proposto o uso de campos condicionais aleatórios para a resolução da tarefa de segmentação de textos por idioma, considerando-a como uma tarefa de marcação de sequências. A metodologia considera que a mudança entre um idioma e outro nos documentos ocorrerá em qualquer parte do texto e pressupõe que as observações no sistema serão dadas pelas palavras no texto e que os estados serão os diferentes idiomas. Conforme os resultados da pesquisa, conclui-se que os campos aleatórios condicionais são uma ferramenta muito poderosa para a segmentação de textos multilíngues.

 

Biografia do Autor

  • Robin Cabeza Ruiz, Universidad de Holguín

    Master in Design Assisted by Computer from the Universidad de Holguín (Cuba, 2015) with a bachelor’s degree in Computer Science from Universidad de Oriente (Cuba, 2017). Currently he is professor of informatics II and member of CAD/CAM Studies Center at the Faculty of Engineering at the Universidad de Holguín. His main areas of interest in research are biomechanical and text segmentation by computer. 

Referências

Baldwin, T. & Lu, M. (2010). Multilingual language identification: ALTW 2010 shared task dataset. In Proceedings of the Australasian Language Technology Association Workshop (pp. 4-7).

Barron, A., Rissanen, J., & Yu, B. (1998). The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6), 2743-2760.

Bird, S. (2006). NLTK: The natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions, (pp. 69-72). Stroudsburg, PA: Association for Computational Linguistics.

Cabeza, R. (2016). Text segmentation by language. Sistemas & Telemática,14(38), 65-74. doi 10.18046/syt.v14i38.2289

Cook, P. & Lui, M. (2012). langid.py for better language modelling. In Proceedings of the Australasian Language Technology Association Workshop, (pp. 107-112).

He, X., Zemel, R. S., & Carreira-Perpiñán, M. Á. (2004). Multiscale conditional random fields for image labeling. In Computer vision and pattern recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE computer society conference on (Vol. 2, pp. II-II). IEEE.

Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic model for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, (pp. 282-289).

Liu, Y., Carbonell, J., Weigele, P., & Gopalakrishnan, V. (2006). Protein fold recognition using segmentation conditional random fields (SCRFs). Journal of Computational Biology, 13(2), 394-406.

Lui, M., Lau, J. H., & Baldwin, T. (2014). Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics, 2, 27-40.

Peng, F. & McCallum, A. (2004). Accurate information extraction from research papers using conditional random fields. In Human Language Texhnology Conference and North American Chapter of the Association for Computational Linguistics. Retrieved from: https://people.cs.umass.edu/~mccallum/papers/hlt2004.pdf

Settles, B. (2005). Abner: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 21(14), 3191-3192.

Sha, F., & Pereira, F. (2003). Shallow parsing with conditional random fields. In NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, (Vol. 1, pp. 134-141). Stroudsburg, PA: Association for Computational Linguistics.

Singh, A. K., & Gorla, J. (2007). Identification of languages and encodings in a multilingual document. In Building and Exploring Web Corpora (WAC3-2007): Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval (Vol. 4, p. 95). Louvain, Belgium: Louvain Université.

Vásquez, A. C., Quispe, J. P., & Huayana, A. M. (2009). Procesamiento de lenguaje natural. Revista de Investigación de Sistemas e Informática, 6(2), 45-54.

Yamaguchi, H., & Tanaka-Ishii, K. (2012). Text Segmentation by Language using minimum description length. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume1, (pp. 969-978). Stroudsburg, PA: Association for Computational Linguistics.

Downloads

Publicado

2017-12-06

Edição

Seção

Discussion papers