How cdec uses UTF-8
From cdec Decoder
For the most part, cdec does not care what character encoding that is used, as long as words are separated by spaces (character 0x20). However, some functionality (compound splitting, features detecting non-ASCII characters) requires interpreting the character contents, and in these places, cdec assumes that its inputs are encoded in UTF-8. If this limited functionality is not used, other encodings will likely work.