I'm working with the .csv
output of this SE data query which looks like this (only with 5022 entries):
"{
""id"": 281952,
""title"": ""Flash 11.2 No Longer Supported by Google Play""
}"
"{
""id"": 281993,
""title"": ""Netbeans won't open in Ubuntu""
}"
(And it has line ^M
endings between [number], and ""title""). I need it to look like this:
281952,Flash 11.2 No Longer Supported by Google Play
281993,Netbeans won't open in Ubuntu
I fixed this in a certain text editor which shall remain nameless quite easily, but I wanted to make a script so that I don't have to do it again every time the query is refreshed & so others can use it. I used sed
...
This series of commands works perfectly (although it may well be inefficient; it is just a trial-and-error solution):
# Print the ^M and remove them, write to a new file:
cat -v QueryR* | sed 's/\^M//' > QueryNew
# remove all the other junk:
sed -i 's/{//' QueryNew
sed -i 's/}//' QueryNew
sed -i 's/""//g' QueryNew
sed -i 's/^"//' QueryNew
sed -i '/,/{N;/\n.*title:\s/{s/,\n.*title:\s/,\ /}}' QueryNew
sed -i 's/^\s\+//' QueryNew
sed -i '/^\s*$/d' QueryNew
sed -i 's/^id:\ //' QueryNew
sed -i 's/,\ /,/' QueryNew
sed -i 's/\\//g' QueryNew
So, why doesn't this? Only the ^M
and {}
get removed, and everything else is still there.
#!/bin/bash
cat -v QueryR* | sed 's/\^M//' > QueryNew
sed -i '{
s/{//
s/}//
s/""//g
s/^"//
/,/{N;/\n.*title:\s/{s/,\n.*title:\s/,\ /}}
s/^\s\+//
/^\s*$/d
s/^id:\ //
s/,\ /,/
s/\\//g
}' QueryNew
I'm sure my mistake is really obvious...
Using
cat -v
to turn CR characters into literal^M
sequences seems fundamentally ugly to me - if you need to remove DOS line endings, usedos2unix
,tr
, orsed 's/\r$//
'If you insist on using sed, then I suggest you print the bits you do want, rather than trying to delete all the random bits you don't - for example
You could get fancy and roll the quote removal into the key-value extraction by matching zero or more quotes at each end of the value sequence
You could get really fancy and emulate the
paste
insed
by first joining pairs of lines on the,\r$
ending and then matching the key-value pairs multiply (g
) and non-greedily(Personally I'd favor the KISS approach and use the first one).
FWIW, since your input appears to be over-quoted JSON, I'd suggest installing a proper JSON parser such as
jq
You can then do something like
which removes the superfluous quotes and then uses
jq
to extract the fields of interest - note thatjq
seems to handle the DOS-style line endings, so there's no need to to take special steps to remove those.Change to
jq '.[]'
to dump all the attribute-value pairs.Credit for inspiration and basic
jq
syntax taken from Overcoming newlines with grep -oI fixed it thanks to steeldriver & further tinkering. Unrefined but works.
translation:
s/"{//
Remove"{
s/}"//
Remove}"
s/^"//
Remove"
from start of line/,\r/{N;/\n.*title.*:\s/{s/,\r\n.*title.*:\s/,\ /}}
match,\r
on one line and[whatever]title[whatever]:
on the next line, replace all that with,
s/""//g
Remove all the remaining double double quotess/^\s\+//
Remove whitespace from start of lines/^\s*$/d
Remove empty liness/^id:\ //
Removeid:
and space after its/\\//g
Remove backslashes (escape chars for " added to some title fields)tee "$1"
specify an outfile when running the script, for example./queryclean newquery.csv
This is not exactly answering your question or solving your issue, but to get rid off the unwanted characters you can use tr:
and you'll get:
While the question asks for
sed
, one could work around sed's issues with Python:This code is compliant with both python2 and python3 , so either will work
Sample run:
Three more approaches:
awk
Perl
GNU grep with perl compatible regexes and simple perl:
This is another script written in Ruby. It will retain the commas in title, which can be easily imported into any spreadsheet program without breaking the columns.
After the program is run the produced output will look like these