I have a lot of XML files, over 50000 of them.
In some XML files, some files are written like this
<filename>abc.JPEG<^Lilename>
^L
is just one character, but I can't find what ^L
means with Google.
When I use cat
to print the content of a file, it shows like the following
<filename>abc.JPEG<
ilename>
Anyway, I want to change <filename>abc.JPEG<^Lilename>
to <filename>abc.JPEG</filename>
I already found some command to change a word in many files, such as
find . -exec perl -pi -e 's/[find_word]/[change_word]/g' {} \;
But that command doesn't work in my case, because it cannot recognize the search word when I just type ^L
.
How can I change <filename>abc.JPEG<^Lilename>
to <filename>abc.JPEG</filename>
in many files?
Control-L (represented as
^L
) is the "form feed" character. In ASCII, it has decimal value 12 (L
is the 12th letter of the alphabet) or hex value 0c:You can replace it using tools like sed by specifying the hexadecimal escape code:
Alternatively, compose
^L
directly using the keyboard sequence CTRL+V CTRL+LFor your specific replacement, given
then
(the
g
modifier is added in case there is more than one instance per line).As Hans-Martin Mosner points out in the comments, it seems that someone used backslashes instead of forward slashes when generating the XML (or possibly ran the whole
<filename>
section through a Unix-to-Windows converter which was overzealous about slashes).\f
is a rarely-used escape sequence for a form-feed character, aka U+0C or ^L. So some later step of the pipeline then replaced the\f
with literal U+0C characters.Fortunately, U+0C is an extremely rare character that's unlikely to be found intentionally in any sort of XML. And since only
\f
would produce this, as opposed to (say)\g
or\k
, a universal find-and-replace should fix not only</filename>
but also</folder>
,</file>
, or anything else that got mangled.That's what steeldriver's sed-script does; I'd just make it very slightly more general:
This means "(s)wap all instances of
\x0c
(that is, U+0C) to/f
, (g)lobally".\f
is the form feed character in Perl. It looks as though these malformed files were created by someone new to both Perl and XML.Here's a much Perlier fix -- which also meets the OP's goals of automating update of all of the files, unlike the accepted answer with sed, which will only work on one file at a time as it isn't paired with
find
.\f
can simply be employed itself instead of the hexadecimal codex0c
.Here I've added
-type f
to telfind
to only return plain files - otherwisefind
will return.
in the list, and trigger a warning when you try to edit it, though everything else will still work.I've also made the regex easier to see by using the
x
flag which ignores real whitespace, allowing you to space out the elements of your regex. If you don't like this, here it is without:And in the likely case that all the form feed characters are spurious and all should be replaced by
/f
, then you can slim the one-liner down even further:You don't need to use forward slashes to surround your regex substitution command's elements (
s///
) in Perl. You can use any symbol. If you choose to use any kind of paired bracket-like symbol, however, you have to use both of them:s[old][new]
for instance.Since I'm not using slashes, I don't have to escape any slashes.
As for
-i.bkp
:perl -pi -e
lets you edit in-place -- but if you want extra insurance in case you got your find-and-replace Perl program wrong, you can put in a file extension so that it will make a copy of the original files for you. Here, I've used.bkp
.In the most recent versions of Perl, in-place editing has been updated to be more resilient in case your system suffers a serious problem like power loss or running out of disk space, too. Here's Perl author brian d foy on improved in-place editing in recent Perls.
You should consider using Perl for these kinds of tasks, because it is an extremely powerful yet under-rated general-purpose programming language, one of whose original design goals was to replace
sed
andawk
with something much better.Perl 5's regex matching capabilities and improved regex syntax far exceed those of
sed
,awk
, and indeed every other programming language apart from Perl 6, making Perl the most sensible choice for both simple and advanced regex manipulations.To clarify:
sed
will work OK withfind
too and you can also usesed -i.bkp
to make a backup of each file edited, but as far as I know it doesn't feature the extra resilience in Perl 5.28 and above. It also uses the clunkier and far less powerful traditional UNIX ® regex syntax.