RocketGit

dleucas / wmmsdb (public) (License: GPLv3) (since 2018-07-08) (hash sha1)

A collection of scripts to download, transform and normalize the Watkins Marine Mammal Sound Database.

Credit:

“Watkins Marine Mammal Sound Database, Woods Hole Oceanographic Institution.”

http://cis.whoi.edu/science/B/whalesounds/index.cfm

Clone URLs: https://rocketgit.com/user/dleucas/wmmsdb ssh://rocketgit@ssh.rocketgit.com/user/dleucas/wmmsdb git://git.rocketgit.com/user/dleucas/wmmsdb

master species_names

List of commits:

Subject	Hash	Author	Date (UTC)
code formating	7457d6f0da9e12ac64fcc6b293d23682a69b45c5	dleucas	2018-07-12 00:34:57
add links, add coverage task	c64feb14912a2f7ae7f8e9bcda5c17ca172cadc0	dleucas	2018-07-08 22:40:04
WIP signal class	af311640c00d17c010680c69112fbaefd799176a	dleucas	2018-07-08 22:39:17
add signal cut size and quality	9174f89773c2b6e934bffb3c2ab329c73350b025	dleucas	2018-07-08 22:06:35
add sample rate and number of channels	c32ad9d8dc8b2e461e4869c60d9cee132047f4b1	dleucas	2018-07-08 21:11:39
as markdown	d9fa587724fd2a46bf3373a8d549c4a39746a3aa	dleucas	2018-07-08 00:54:09
add info on cue values	8e60832b65879f8c11a794715fa51bb2a0d2488b	dleucas	2018-07-08 00:53:26
WIP: postion transform, cue and time done	82dde8774e24b1c582a4ac222758e9b98da28961	dleucas	2018-07-08 00:51:45
initial jq transform filter	a688a9ec40d59a7de841eb86e1eeca72c64664b2	dleucas	2018-07-07 23:48:50
add ignores	f8b235e00a265d3c4a6613e3174018ef754f2407	dleucas	2018-07-07 22:06:53
rename to markdown	8f99257ba05ee39b629b7d2281c149bbbe941b29	dleucas	2018-07-07 22:06:15
markdown	0a1aa88a79caad45e62f086716592e00b51ff36e	dleucas	2018-07-07 22:06:03
progress	65700e843b38245a91b97658f63d640c70cefe6f	dleucas	2018-07-07 02:30:36
transform HTML table to JSON object	058f59f35e7e40c437609d91889ceb5a786e005b	dleucas	2018-07-07 02:29:29
initial data survey	7a0ba30602c78bf1f06a19d4322e503e3d11e050	dleucas	2018-07-07 02:28:12
progress information	90dca6f739de3887394810e419457412cb2fd9fa	dleucas	2018-07-06 01:42:38
working download of metadata pages	071d8ad292bcf1486ce55f85eb0bc2102ab9c09a	dleucas	2018-07-06 01:42:06
download metadata pages, initial script	ed63b4125deec37961e76b0f45592ad8549483d4	dleucas	2018-07-06 00:12:09

Commit 7457d6f0da9e12ac64fcc6b293d23682a69b45c5 - code formating
Author: dleucas
Author date (UTC): 2018-07-12 00:34
Committer name: dleucas
Committer date (UTC): 2018-07-12 00:34
Parent(s): c64feb14912a2f7ae7f8e9bcda5c17ca172cadc0
Signer:
Signing key:
Signing status: N
Tree: 11cf96a3ac5b6415b8bed94f98f3b0331a6e2515

File	Lines added	Lines deleted
transform.jq	70	69

File transform.jq changed (mode: 100644) (index 341859d..45aa5be)
2	2	# Source data combines multiple values into one field, so split that up	# Source data combines multiple values into one field, so split that up
3	3	# also use native data types if possible.	# also use native data types if possible.
4	4
5		{
6		# record number is unique, use as _id
7		_id: .RN,
8		# object contains properties of the captured signal
9		signal: {
10		# create a list of JSON objects and add them together
	5		{
	6		# record number is unique, use as _id
	7		_id: .RN,
	8		# object contains properties of the captured signal
	9		signal: {
	10		# create a list of JSON objects and add them together
11	11
12		# Cue field contains 3 values describing the postion on tape
13		# Example input from the docu
14		# 542 B2:8 8.130
15		# 1:03:12 B2:8 8.130
16		# however, following formats are also found
17		# 0:00:00 B30:00 10:20.602
18		# 995 B11:28.497 5:20.426
19		# 96 B4.00 1.525
20		# 93 B23.7 9.164
21		# 93 B3:00 2:13.828
22		# 01:52:52:04
23		# 09:11:00 20:00 951.50
24		# 0 B2:00:00
25		position: [
26		# keep the source string as reference?
27		{_source_string: .CU},
	12		# Cue field contains 3 values describing the postion on tape
	13		# Example input from the docu
	14		# 542 B2:8 8.130
	15		# 1:03:12 B2:8 8.130
	16		# however, following formats are also found
	17		# 0:00:00 B30:00 10:20.602
	18		# 995 B11:28.497 5:20.426
	19		# 96 B4.00 1.525
	20		# 93 B23.7 9.164
	21		# 93 B3:00 2:13.828
	22		# 01:52:52:04
	23		# 09:11:00 20:00 951.50
	24		# 0 B2:00:00
	25		position: [
	26		# keep the source string as reference?
	27		{_source_string: .CU},
28	28
29		# "cue" as in a first matched single integer,
30		# without dot or colon followed by space or end of string
31		# do not use \b because of the colon in 00:00 values
32		(.CU \| capture( "(?<cue>^\\d+(\\s\|$))" ) \| {cue: .cue\|tonumber } ),
	29		# "cue" as in a first matched single integer,
	30		# without dot or colon followed by space or end of string
	31		# do not use \b because of the colon in 00:00 values
	32		(.CU \| capture( "(?<c>^\\d+(\\s\|$))" ) \| {cue: .c\|tonumber } ),
33	33
34		# "time" as in first matched integer with 2 or 3 colons
35		# followed by space or end of string
36		(.CU \| capture( "(?<time>^\\d+:\\d+:\\d+(:\\d+)?(\\s\|$))" ) ),
	34		# "time" as in first matched integer with 2 or 3 colons
	35		# followed by space or end of string
	36		(.CU \| capture( "(?<time>^\\d+:\\d+:\\d+(:\\d+)?(\\s\|$))" ) ),
37	37
38		# buffer size, B followed by integer with colon or dot, also remove B prefix
39		# TODO match 2 colon version
40		(.CU \| capture("(?<analyzer_buffer_size>(?<=B)\\d+[:\\.]\\d+(\\.\\d+)?)") )
41		] \| add,
42		# cut size
43		# 3.36
44		# 9.411
45		# 16.564
46		# 20.35
47		# etc
48		# only 210 records use a different format, ignored for now
49		# 2:00.000
50		# 1:00.030
51		# 10:25.540
52		# 1:25.158
53		# etc.
54		cut_size: (
55		# skip empty records, or with a colon
56		if (.CS \| contains(":") or (length == 0)) then
57		empty
58		else
59		# cast as number and handle a few remaining badly formated records like "0.2.95"
60		(try (.CS \| tonumber) catch empty)
61		end
62		),
63		# any digit in the signal class indicates quality
64		# it's only been used 121 times
65		quality: ( .SC \| capture("(?<q>\\d+)") \| .q \| tonumber )
66		#_source_string_sc: .SC,
67		#class: (
	38		# buffer size, B followed by integer with colon or dot,
	39		# also remove B prefix
	40		# TODO match 2 colon version
	41		(.CU \| capture("(?<analyzer_buffer_size>(?<=B)\\d+[:\\.]\\d+(\\.\\d+)?)") )
	42		] \| add,
	43		# cut size
	44		# 3.36
	45		# 9.411
	46		# 16.564
	47		# 20.35
	48		# etc
	49		# only 210 records use a different format, ignored for now
	50		# 2:00.000
	51		# 1:00.030
	52		# 10:25.540
	53		# 1:25.158
	54		# etc.
	55		cut_size: (
	56		# skip empty records, or with a colon
	57		if (.CS \| contains(":") or (length == 0)) then
	58		empty
	59		else
	60		# cast as number and handle a few remaining badly formated
	61		# records like "0.2.95"
	62		(try (.CS \| tonumber) catch empty)
	63		end
	64		),
	65		# any digit in the signal class indicates quality
	66		# it's only been used 121 times
	67		quality: ( .SC \| capture("(?<q>\\d+)") \| .q \| tonumber )
	68		#_source_string_sc: .SC,
	69		#class: (
68	70	#if ( .SC == "M") then "Mimic" else empty end	#if ( .SC == "M") then "Mimic" else empty end
69	71	# elif ( ($SC \| contains("M")) or ($SC == "M")) then "Mimic"	# elif ( ($SC \| contains("M")) or ($SC == "M")) then "Mimic"
70	72	#elif ( .SC \| contains("V") or .SC == "V") then "Variant"	#elif ( .SC \| contains("V") or .SC == "V") then "Variant"
71	73	#elif ( .SC \| contains("D") or .SC == "D") then "Deletion"	#elif ( .SC \| contains("D") or .SC == "D") then "Deletion"
72	74	#elif ( .SC \| contains("U") or .SC == "U") then "Uncharacteristic"	#elif ( .SC \| contains("U") or .SC == "U") then "Uncharacteristic"
73	75	#elif ( .SC \| contains("C") or .SC == "C") then "Calf"	#elif ( .SC \| contains("C") or .SC == "C") then "Calf"
74		#)
75		},
76		sound: {
	76		#)
	77		},
	78		sound: {
77	79	# plain sample rate as number, however not normalized in digit length	# plain sample rate as number, however not normalized in digit length
78	80	# remove dot or colon, and ignore empty strings	# remove dot or colon, and ignore empty strings
79	81	# a bit difficult to tell what is hz and what khz	# a bit difficult to tell what is hz and what khz

97	99	{_source_string: .NC},	{_source_string: .NC},
98	100	(	(
99	101	.NC \|	.NC \|
100		capture("^(?<recorded>\\d)(?<multiplexed>\\d)(?<side>[A-L]$)") \|
	102		capture("^(?<r>\\d)(?<m>\\d)(?<s>[A-L]$)") \|
101	103	{	{
102		recorded: .recorded \| tonumber,
103		multiplexed: .multiplexed \| tonumber,
104		side: .side
	104		recorded: .r \| tonumber,
	105		multiplexed: .m \| tonumber,
	106		side: .s
105	107	}	}
106	108	)	)
107	109	] \| add	] \| add
108		}
109		}
110
	110		}
	111		}

Hints:
Before first commit, do not forget to setup your git environment:

git config --global user.name "your_name_here"
git config --global user.email "your@email_here"

Clone this repository using HTTP(S):

git clone https://rocketgit.com/user/dleucas/wmmsdb

Clone this repository using ssh (do not forget to upload a key first):

git clone ssh://rocketgit@ssh.rocketgit.com/user/dleucas/wmmsdb

Clone this repository using git:

git clone git://git.rocketgit.com/user/dleucas/wmmsdb

You are allowed to anonymously push to this repository.
This means that your pushed commits will automatically be transformed into a merge request:

... clone the repository ...
... make some changes and some commits ...
git push origin main