Data Wrangling
sed + regular expressions
Regular Expressions
- Common patterns
.any single character except newline*zero or more of the preceding match+one or more of the preceding match[abc]any one character ofabandc[^abc]any character excludingabandc\dany digit\Dany non-digit character\wany alphanumeric character\Wany non-alphanumeric character{m}m repetitions{m, n}m to n repetitions\sany whitespace\Sany non-whitespace character(RX1|RX2)either something that matches RX1 or RX2^the start of the line$the end of the line
sed Basic Usage
sed 's/.*Disconnected from //'s/REGEX/SUBSTITUITION/sstands for substitutionREGEXsome pattern you want to matchSUBSTITUTIONthe text you want to substitute matching text with
- A tricky case:
Jan 17 03:13:00 thesquareplanet.com sshd[2631]: Disconnected from invalid user Disconnected from 46.97.239.16 port 55920 [preauth]- Some user named "Disconnected from"
*,+does greedy matching
- Pass
-Eto avoid putting\before some special characters - Capture groups
- a regex surrounded by parentheses is stored in a numbered capture group
\1,\2, ...
- a regex surrounded by parentheses is stored in a numbered capture group
Data Wrangling Tools
Small Tools
sortsort its inputuniq -ccollapse consecutive lines that are the same into a single line, prefixed with a count of the number of occurrencespaste -sd,combine lines by a single character specified by-d{char}
awk - another editor
awk {print $2}print the second field of the delimeter, delimeter can be specified by-Fawk '$1 == 1 && $2 ~ /^c[^ ]*e$/ { print $2 }' | wc -lspecify a pattern: the first field in the line should be 1, the second field should match the regular expression.wc -lto count the number of lines that match such patternawkas a programming languageBEGIN { rows = 0 }$1 == 1 && $2 ~ /^c[^ ]*e$/ { rows += $1 }END { print rows }
Analyzing data
bc -lcan do basic calculation- Can also combine
Randgnuplotto do more advanced data analysis and plots