Valid UTF-8 data (hex:) followed by invalid UTF-8 sequence

OK, this one is a bit geeked out again, but it’s relevant to China. If you’re an american, you could probably go your entire life without ever bumping into codepages, but if you’re life crosses paths with asia, you almost certainly will…

As we’re developing a new website,doing our subversion (version control system) check-in, I started bumping into a very unusual error.

ryan@116843:/spike/public/news/app/webroot/redv1.0/img/menu$ sudo svn up svn: Valid UTF-8 data (hex:) followed by invalid UTF-8 sequence (hex: b8 b4 bc fe)

Unfortunately, google didn’t come up with much. The best hit was a Oct 10th post on the subversion users mailing list. Basically, the answer is that there’s no answer.

Well, I did an svn up in each child directory of the one causing the problem and eventually tracked the error down through my project’s directory tree. It looks like one of the guys using a windows system copied a JPEG with a Chinese GBK encoded filename onto the server. Everything is best kept in UTF-8.

Once finding the right file, you have to figure out how to delete a file with a name that can’t be typed…

ryan@116843:/spike/public/news/app/webroot/redv1.0/img/menu$ ls
logo02.jpg       ???? logo.jpg  menu_acc_down.jpg      menu_home_down.jpg  menu_work_down.jpg
logo03.jpg       logo.jpg       menu_acc.jpg           menu_home.jpg       menu_work.jpg
logo04.jpg       logo_top1.jpg  menu_cameras_down.jpg  menu_len_down.jpg
logo05.jpg       logo_top2.jpg  menu_cameras.jpg       menu_len.jpg
logo06.jpg       logo_top3.jpg  menu_gall_down.jpg     menu_tech_down.jpg
logo_bottom.jpg  logo_top.jpg   menu_gall.jpg          menu_tech.jpg

In this case, I just used: rm *\ logo.jpg since there was only one file matching this pattern… Next, I could commit again!

ryan@116843:/spike$ sudo svn up D public/.htaccess Updated to revision 38.

Comments (27)

Barry HunterMay 4th, 2007 at 1:24 am

Thanks, that was just my problem!

(no other results suggested filename issues :( )

I wrote a tiny script to enumerate though the directory, outputing the path then running ’svn status’ on each one to find the culprit. (as found no way to get svn to output which folder it was about to try before doing it – so it would show before the ‘helpful’ error message)

Carl-ErikMay 4th, 2007 at 2:59 pm

Thanks! I also ran into this problem, and could not see anyone coming up with a solution. Actually thought the problem laid within the files – thus deleting the ones making trouble would fix the problem. Good thing I spotted your blog first :-)

Tom LMay 9th, 2007 at 2:27 pm

Thanks for this post. You saved me quite a bit of time.

OwenJune 29th, 2008 at 10:50 am

Thanks, you saved me a bit of time. You can run into problems importing invalid utf too, I was importing a Wordpress sitemap plugin that gave me problems. The last filename before the error was the folder of files that were invalid.

Hope this helps some one.

AuroreOctober 1st, 2008 at 6:45 am

I’ve juste had this problem. I force UTF-8 encode to all the last file i change and it’s works !

Brian MarkNovember 3rd, 2008 at 11:11 am

Too bad you’ve got a lot of spam on here, but the answer here was perfect. Thanks.

ryanNovember 3rd, 2008 at 11:33 am
Too bad you’ve got a lot of spam on here, but the answer here was perfect. Thanks.

I’ve just gone through and cleaned up some of the remaining SPAM, but I must admit that I have no clue where some of it comes from… To submit a comment you should be required to fill out the RE-CAPTSHA field, but it seems that some spammers have found a way around this. Since installing RE-CAPTSHA the SPAM has slowed down dramatically, but it does still arrive and the rate does seem to be increasing again.

Maybe there’s a backdoor in re-CAPTSHA?

evdsandeNovember 11th, 2008 at 5:21 am

I ran into the same error, however I cannot find the file causing the error. I even now have the same error on all my repositories, even the ones that were not affected. I still don;t have a clue how to recover my repositories I already removed the malicious project from the affected repository using svnfilter with no effect. Is there a way to rebuild the repository form the dump and identfy the malicious directory or file???

Regards Eric

ryanNovember 11th, 2008 at 5:26 am

@Eric-

Eric-

Your best bet is to go through each of the subdirectories of your project and do an “svn up” on the directory one at a time until you find the subdirectory(s) containing the file(s) that have an incompatible encoding.

Another useful tool is the Unix “file” command. Just run “file *” in the directory that you locate with the incompatible encoding and search for the file that comes back to you with an encoding type other than UTF-8.

Best of luck-

-R

evdsandeNovember 11th, 2008 at 6:28 am

Hi Ryan,

I did the exercise as you suggested, but I don’t see the UTF-8 encoded types. The file * command gives me “ASCII C++ program text, with CRLF line terminators”, “XML document text” etc… but no UTF-8, besides when I ran into these trouble I removed the latest directory that created this error from my repository, my current repository contains only those parts that where there before the problem popped up, however the problem still exists. The problem even pop’s up on repositories that were never touched?? I switched to an other svn client, tried the commandline and even created a new repository to test, all of them give me the same error and now I’m totally puzzeld and stuck with a messed-up subversion installation.

this is the part of the logging from tortoisesvn where the error popped up for the first time.

Adding : C:\Projects\Visual Studio 2005\Projects\TortoiseRedminePlugin\inc Adding : C:\Projects\Visual Studio 2005\Projects\TortoiseRedminePlugin\inc\IBugTraqProvider.idl Adding : C:\Projects\Visual Studio 2005\Projects\TortoiseRedminePlugin\inc\IBugTraqProvider_i.c Adding : C:\Projects\Visual Studio 2005\Projects\TortoiseRedminePlugin\inc\IBugTraqProvider_h.h Adding : C:\Projects\Visual Studio 2005\Projects\TortoiseRedminePlugin\inc\Interop.BugTraqProvider.dll Adding : C:\Projects\Visual Studio 2005\Projects\TortoiseRedminePlugin\issue-tracker-plugins.txt Error : Valid UTF-8 data Error : (hex:) Error : followed by invalid UTF-8 sequence Error : (hex: c0 a4 01) Finished! : 52 kBytes transferred in 0 minute(s) and 2 second(s)

I removed the directories and files added here using svnadmin dump and svnfilter, however in the new repository the error still persists (I removed most of the revisions that handled the above actions, but not all of them could be removed). What I cannot understand is, how this can affect a different repository?

ryanNovember 11th, 2008 at 7:05 am

@Eric-

I understand your frustration buddy. Just trying to help. You’re not looking for files that ARE UTF-8. You’re looking for files that ARE NOT UTF-8.

In your case, just run: cd C:\Projects\Visual Studio 2005\Projects\TortoiseRedminePlugin svn up inc svn up … (where “…” is another folder under TortiseRedminePlugin)

You’ll notice that this will run successfully for most of your files, but there will be some that don’t complete successfully. When you locate the folder where the error starts, then you run “svn up” one file at a time until you find the individual file(s) that are causing the problem. Delete those files from the repository and you’ll be all fixed up.

As for the UTF-8 issue… Again – the key is to find files that are in encodings OTHER THAN UTF-8. You’ve got such a file – it’s just a matter of finding it.

evdsandeNovember 11th, 2008 at 7:41 am

Hi Ryan,

Sorry if I sounded offending, It’s subversion that’s frustrating me, giving me no decent clue where to look. I appreciate your help very much. Thanks a lot ! I will go again through my files, but how about the project I removed from the repository, should I add it again to the repository to go though all the files in there ?

evdsandeNovember 11th, 2008 at 12:30 pm

Finally I rebuild subversion with a patch that showed me the directory that was causing the trouble (found the patch at: http://www.nabble.com/-PATCH–Issue–2748:-non-UTF-8-filenames-in-the-repository-td19531299.html) however it didn’t help me yet. The error I get now is: Adding D:\Projects\t\TextDocument.txt Error Error converting entry in directory Error ‘/dtm/home/svn/svn/t/db/transactions/0-0.txn’ to UTF8 Error Valid UTF-8 data Error (hex:) Error followed by an invalid UTF-8 sequence Error (hex: c0 a4 01)

Now there are at least 2 things I don’t get 1: The dir points is a subversion repository transactionlog file not the project file I’m trying to import 2: This is a new repositories, I deleted all my old repositories and did a fresh import of a single text file edited with vi

And I still get the same error ???? I’m totally puzzeled if you have any clue,

Please….

Thankz Eric

AgrisFebruary 11th, 2009 at 1:20 am

Yes, filenames with non-latin chars cause the problem. Deleting them fixes the problem.

[...] RyanErwin.com » Valid UTF-8 data (hex:) followed by invalid UTF-8 … [...]

cooperMarch 7th, 2009 at 2:42 pm

strace svn status will give you the name of the offending file. unfortunately, svn care about name of files that are in one of its directories, even if it’s not under revision.

ryanJune 23rd, 2009 at 9:23 pm

Evdsande-

For some reason I never got notified about this comment.

The issue as reported above applies certainly to Mac OS X, Linux and other Unix systems, but I haven’t used Windows for more than a few minutes in the entire 21st Century, so I’m of limited help here.

The error does look the same though, so it is most likely one of the files under your “TortoiseRedminePlugin” folder. I would go through the sub-folders one at a time (first to “inc”) and then to the other sub-folders and do the commit one by one.

One of the files should have the encoding issue as described. Delete that file and the commit should proceed properly.

Best-

-R

zombatAugust 2nd, 2009 at 11:57 am

Great post, I found it via Google. My problem was almost identical. An image got copied into a checked out tree, and svn update began failing due to a wacky character in the image name.

Thanks for taking the time to post this! It’s always a great feeling to find a solution to strange issues that actually work.

joOctober 8th, 2009 at 12:43 pm

also happens for files that are not part of SVN!

my program spits out a log file each run — an error in the filename string made a bunch of log files with garbled file names. i got the same error as others (valid UTF-8 followed by invalid UTF-8), even though none of these log files were ever checked into the repo!

DomOctober 28th, 2009 at 1:50 am

Just joining in with the thanks!

Top result on Google for svn “followed by invalid UTF-8 sequence”, you should be proud :)

Carlos BarrosDecember 1st, 2009 at 4:20 am

Thanks for this post, this saved me some time. I just want to add my two cents. Sometimes finding the exact file that’s causing the problem is tough. I have a images directory with 2k files and one of these have this problem.

svn: Error converting entry in directory 'images/thumbnails' to UTF-8
svn: Valid UTF-8 data
(hex: 4e 6f 6b 69 61 2d 35 35 33 30 2d)
followed by invalid UTF-8 sequence
(hex: 96 2d 58 70)

So this told me the directory was images/thumbnails. To find which file, i did:

$ printf "\x4e\x6f\x6b\x69\x61\x2d\x35\x35\x33\x30\x2d\n"
Nokia-5530-

So this told me the filename starts with Nokia-5530- :)

Hope this helps

Carlos

JoniMay 26th, 2010 at 9:37 pm

thanks, solved my problem :)

ArjunaJune 30th, 2010 at 7:19 am

If this can help somebody I wrote a quick article on how to convert the files without need to delete them: http://arjuna.deltoso.net/articles/subversion-messy-encoding-valid-utf-8-data-followed-by-invalid-utf-8-sequence/ in any case thanks Ryan for the time you spent writing your article that helped me to partly solve the problem. Arjuna

GloWJuly 13th, 2010 at 9:06 pm

thx, solved my problem

[...] Valid UTF-8 data (hex:) followed by invalid UTF-8 sequence | ERWIN.co [...]

[...] Valid UTF-8 data (hex:) followed by invalid UTF-8 sequence | ERWIN.co [...]

vadlibAugust 20th, 2010 at 11:35 pm

All I did was find the revision that was giving me the error. Once I found it I saw in the comments that there were some unreadable characters like some genius cut and pasted from a word doc into the file. I replaced them with the correct characters and then every thing worked.

Leave a comment

Your comment