This lesson is in the early stages of development (Alpha version)

PyMARC Basics

Introduction and Setup

Overview

Teaching: 10 min
Exercises: 15 min
Questions
Objectives
  • Install and check Python v3.x

  • Install and check pymarc

  • Collect data files used in lesson

  • Basic preflight checks

Episode 1: Getting Started

Hello World!

Start a new python file in your IDE of choice.

We’re using SublimeText, so we just need to save a new file as test.py in SublimeText.

We can get python to tell us what version it is with the following code:

import sys
print(sys.executable)

To ‘deploy’ the script in Sublime, you can use the following keystrokes:

Control + S to save your changes

Control + B to “build” the script and run it.

Its worth remembering these two commands, you’ll use them a lot in this lesson!

We’re looking to see where python is installed - theres a version number in the file path. We want to see Python3x to make sure we’re all using Python version 3.

C:\Users\gattusoj\AppData\Local\Programs\Python\Python38\python.exe

Next we need to install pymarc.

Open command line by hitting the Windows key, type cmd and hit Enter
Open terminal if you’re on Linux - Ctrl+t

To install packages, we use a tool called ‘Pip Installs Packages’ or better known as PIP. To install pymarc, at the cursor type pip3 install pymarc and hit Enter. You need to be connected to the public internet to use PIP. If it worked, you should see something like: Successfully installed pymarc-3.2.0

Make a folder to hold all the files we’ll use in this lesson called pymarc_basics somewhere that you can easily find.

To test it worked, open your IDE (SublimeText) and make a new python file (e.g. test.py) and type:

import pymarc

print ("{}".format("Hello World!") 

Run the script, and if there is no error messages in the python terminal window and you see the text Hello World!, you’ve successfully installed pymarc!

We’re using a slightly over complicated way of printing our “Hello World!” string in this test. There a subtle but important difference in some versions of Python, and how it handles text. By using this format() method we can double check that we’re using a version of Python that will work OK with the rest of the lesson.

We need to make sure we have a local copy of the MARC file we’ll use for the rest of the lesson. You can find all the data files, and helper scripts in the setup folder: http://bit.ly/2PItN0Y

At minimum, download the NLNZ_example_marc.marc file. There are other .marc files in that location. These just contain more records. Download them if you want to have more MARC records to explore.

Save any MARC files in to the same folder as your scripts.

You should have a folder structure that looks like this:

pymarc_basics
	episode_1.py
	test.py
	NLNZ_example_marc.marc
	...

Hello PyMARC!

Finally, lets put all this together, and see if we can read the marc file.

Start a new python file, episode_1.py and type the following code:

import pymarc

my_marc_file = "NLNZ_example_marc.marc"

with open(my_marc_file, 'rb') as data:
	reader = pymarc.MARCReader(data)
	for record in reader:
		print (record.title())

We’ll cover off this script in the next episode. For now you want to see a list of titles that are included in our test set of MARC records:

Larger than life : the story of Eric Baume /
Difficult country : an informal history of Murchison /
These fortunate isles : some Russian perceptions of New Zealand in the nineteenth and early twentieth centuries /
As high as the hills : the centennial history of Picton /
Revised establishment plan.
A Statistical profile of young women /
Chilton Saint James : a celebration of 75 years, 1918-1993 /
Tatum, a celebration : 50 years of Tatum Park /
The Discussion document of the Working Party on Fire Service Levy / ... prepared by the Working Party on the Fire Service Levy.
1991 New Zealand census of population and dwellings.

Key Points

  • The software we need for the lesson is correct versions and works as expected

  • The files we need for rest of lesson are available


MARC basics

Overview

Teaching: 10 min
Exercises: 15 min
Questions
Objectives
  • What is MARC?

  • Where are helpful resource for MARC?

  • Understanding the basic structure of a MARC file in pymarc

Episode 2: MARC basics

What is MARC?

“MARC is the acronym for MAchine-Readable Cataloging. It defines a data format that emerged from a Library of Congress-led initiative that began nearly forty years ago. It provides the mechanism by which computers exchange, use, and interpret bibliographic information, and its data elements make up the foundation of most library catalogs used today. MARC became USMARC in the 1980s and MARC 21 in the late 1990s.” https://www.loc.gov/marc/

“MARC Terms and Their Definitions” https://www.loc.gov/marc/umb/um01to06.html#part3

“This online publication provides access to both the full and concise versions of the MARC 21 Format for Bibliographic Data.” http://www.loc.gov/marc/bibliographic/

Reading MARC files with PyMARC

Lets take a look at a marc record, and see what it contains.

Start a new python file episode_2.py and type the following script:

import pymarc

my_marc_file = "NLNZ_example_marc.marc"

reader = pymarc.MARCReader(open(my_marc_file, 'rb'), force_utf8="True") 

for record in reader:
	print (record)
	quit()

You should see a marc record, line-by-line, in the terminal window:

=LDR  00912cam a2200301 a 4500
=001  9962783502836
=003  Nz
=005  20161223124839.0
=008  731001s1967\\\\at\ac\\\\\\\\\00010beng\d
=035  \\$z4260
=035  \\$a(nzNZBN)687856
=035  \\$9   67095940
=035  \\$a(Nz)3760235
=035  \\$a(NLNZils)6278
=035  \\$a(NLNZils)6278-ilsdb
=035  \\$a(OCoLC)957343
=040  \\$dWN*
=042  \\$anznb
=050  0\$aPN5596.B3$bM3
=082  0\$a823.2$220
=100  1\$aManning, Arthur,$d1918-
=245  10$aLarger than life :$bthe story of Eric Baume /$cby Arthur Manning.
=260  \\$aSydney [N.S.W.], :$aWellington [N.Z.] :$bReed,$c1967.
=300  \\$a184 p., [8] p. of plates :$bill., ports. ;$c23 cm.
=500  \\$aEric Baume was a New Zealander.
=600  10$aBaume, Eric,$d1900-1967.
=650  \0$aJournalists$zAustralia$xBiography.
=650  \0$aAuthors, New Zealand$y20th century$xBiography. 

Understanding MARC field in PyMARC

Its useful to understand the structure of the data object we can see, and how it relates to the MARC format.

Notice the first piece of data starts with an =. The first one we see is =LDR. The = tells us its a field label, or tag, and the following 3 characters are the tag value. In this case LDR tells us we’re reading the leader field. All other MARC fields are zero-padded 3 digit numbers. The following data on that row is the field value. In MARC, the first 9 fields are called control fields, and are structured a little differently to all other fields. Its worth referring to the MARC guidelines to make sure we can correctly understand what the data is telling us… http://www.loc.gov/marc/bibliographic/bd00x.html

We’ll get into the details of some of the control fields later on. For now lets focus on one of the “standard” fields and take a look at how its made up. Lets look at the 245 field, the “title statement” http://www.loc.gov/marc/bibliographic/bd245.html

=245  10$aLarger than life :$bthe story of Eric Baume /$cby Arthur Manning.

We’ve established the =245 is telling us the field name. We can see a couple of spaces, and then the digits 1 and 0. These are the two ‘indicators’ for the field. Sometimes they’re empty or blank, sometimes they’re unused (and sometimes encountered as a slash) character. Sometimes they’re set to a value like we can see here. We’ll look at how to parse this information in a moment.

Following the two indicators, we can see some data thats separated by the $ character:

$aLarger than life :$bthe story of Eric Baume /$cby Arthur Manning.

The $ is always followed by another character, the subfield label, and then the value of the subfield. In this instance we can see 3 subfields; a, b, and c:

a| Larger than life :
b| the story of Eric Baume /
c| by Arthur Manning.

Spend a moment looking at the MARC specs for this field, and the data we have in this record.

http://www.loc.gov/marc/bibliographic/bd245.html

Using the MARC standard to “read” a field - indicators

What do the two indicators tell us about the 245 field?

Solution

Indicator 1 is set to the value 1

Looking at the MARC standard for this field, we can see that has the value “1 - Added entry”

Indicator 2 is set to the value 0

Looking at the MARC standard for this field, we can see that has the value “0 - No nonfiling characters”

Using the MARC standard to “read” a field - subfields

What do the three subfields tell us about the 245 field?

Solution

We can see the subfields ‘a’, ‘b’, and ‘c’

Referring to the MARC standard we can see:

“$a - Title”

“Data includes parallel titles, titles subsequent to the first (in items lacking a collective title), and other title information.”

“$b - Remainder of title”

“Data includes parallel titles, titles subsequent to the first (in items lacking a collective title), and other title information.”

“$c - Statement of responsibility, etc.”

“First statement of responsibility and/or remaining data in the field that has not been subfielded by one of the other subfield codes.”

Key Points

  • We know where to find resources that describe MARC records/fields

  • We know how to identify the component parts of a MARC record/field


Parsing with PyMARC - Part One

Overview

Teaching: 30 min
Exercises: 30 min
Questions
Objectives
  • How to read a marc record with pymarc

  • Finding specific fields with pymarc

Episode 3: Parsing with pymarc

Finding specific fields with PyMARC

Start a new file in your IDE episode_3.py

Set up the basic record reader like we did in episode 2:

from pymarc import MARCReader

my_marc_file = "NLNZ_example_marc.marc"

with open(my_marc_file, 'rb') as data:
    reader = MARCReader(data)
    for record in reader:
        print (record)

We can use the record object created by pymarc to only process fields we’re interested in. We can do that by telling python the field name/label we’re interested in e.g. print (record['245']) In this piece of code we’re asking pymarc to print any field in our record object that has the label/name 245

If we add this piece of code to our basic file parser we can see all the title statements for our test set:

for record in reader:
	print (record['245'])

=245  10$aLarger than life :$bthe story of Eric Baume /$cby Arthur Manning.
=245  10$aDifficult country :$ban informal history of Murchison /$cMargaret C. Brown.
=245  00$aThese fortunate isles :$bsome Russian perceptions of New Zealand in the nineteenth and early twentieth centuries /$cselected, translated and annotated by John Goodliffe.
=245  10$aAs high as the hills :$bthe centennial history of Picton /$cby Henry D. Kelly.
=245  10$aRevised establishment plan.
=245  02$aA Statistical profile of young women /$ccompiled for the YWCA of Aotearoa-New Zealand by Nic Mason, Dianna Morris and Angie Cairncross ; with the assistance of Shell New Zealand.
=245  10$aChilton Saint James :$ba celebration of 75 years, 1918-1993 /$cJocelyn Kerslake.
=245  10$aTatum, a celebration :$b50 years of Tatum Park /$cby Tom Howarth.
=245  04$aThe Discussion document of the Working Party on Fire Service Levy /$b... prepared by the Working Party on the Fire Service Levy.
=245  00$a1991 New Zealand census of population and dwellings.$pNew Zealanders at home.

Understanding Python data types and data objects

What data type does the record object appear to be?

Hint: print (type(record))

Solution

The record object that PyMARC creates looks like its an instance of the python data type called a dictionary.

Getting used to what data types look like, and how to ‘ask Python’ what they actually are important skills to develop.

We can ask python to tell us what the data type is of any object using the type() function. Doing this reveals that the record is a class object - <class 'pymarc.record.Record'>.

Once we’ve spent a little time around python code, we might guess that the record item is a particular data type called a dictionary or dict(). The main clue we might rely on is the square brackets after the object name. e.g my_dict['my_dict_key']

We’re not going to spend serious time in this lesson exploring either class or dict structures in this lesson. If you’re interested to learn more about them there are many free resources that can help, like https://www.tutorialspoint.com/python/python_data_structure.htm

Behind the scenes when this script is run, python looks at the data object that pymarc created, and looks for the part of it that has the label, or ‘key’ of “245”.

See what happens if you give it a key that isn’t included in the data object:

for record in reader:
	print (record['this_key_doesnt_exist'])
	quit()

None

Notice it doesn’t give you an error or otherwise strongly signal that it did not find the key you’re interested in the data object. Its useful to know this happens if we use an key that doesn’t exist. If you see the return None in your scripts where you are expecting an actual value, double check the key you’ve used. Common errors would be typos (e.g. record['254'] instead of record['245']) or using a number instead of a string (e.g. record[245] instead of record['245']).

In this case, returns None, which itself is an important concept in Python. Its worth pausing for a moment and making sure we understand what None means in the Python context.

The None keyword is used to define a null value, or no value at all.

None is not the same as 0, False, or an empty string.

None is a datatype of its own (NoneType) and only None can be None.”

https://www.w3schools.com/python/ref_keyword_none.asp

Accessing Subfields

We can use the same ‘key’ method to get to any subfields we might have:

for record in reader:
	print ("Subfield 'a':", record['245']['a'])
	print ("Subfield 'b':", record['245']['b'])
	print ("Subfield 'c':", record['245']['c'])
	quit()

Notice how we’re asking for 3 things in each of these print statements. We’re asking python to look in the data object called record. We’re asking for the part of that item that has the key '245'. Within that subset of the record object, we’re further asking for the part that has the key 'a', 'b' or 'c' _____

Subfield 'a': Larger than life :
Subfield 'b': the story of Eric Baume /
Subfield 'c': by Arthur Manning.

Nuance of punctuation use in MARC

What do you notice about these pieces of data?

Solution

The text we see is three individual bits of text. They are not particularly well connected to each other in this form - we need to process all of the 245 field as a whole to make sure we’re operating on the data we’re expecting.

Notice the punctuation that is included in the text. We might need to do something to clean that up if we’re going to further process the data we find in this field. The use of punctuation marks within a piece of information to indicate specific parts within the data item is not limited to the 245 field. We can see it variously throughout the MARC record. “In the past, catalogers recorded bibliographic data as blocks of text and used punctuation to demarcate and provide context for the various data elements. This made data understandable to users when it was presented on cards and in online catalogs.” https://www.loc.gov/aba/pcc/documents/PCC-Guidelines-Minimally-Punctuated-MARC-Data.docx This is one of the interesting challenges we face when processing MARC records in bulk/computationally. These punctuation marks have a strong historical place in cataloging practice, and as a result, they’re an artifact we need to be aware of and deal with computationally.

What might we need to consider if we want to use data that potentially spans multiple subfields?

Solution

Supposing we want to try and match title of an item with a MARC record, and a list of book titles we’ve been given to compare we might find ourselves with a problem.

How do we know which of the subfields is the right amount data to search for a match? This isn’t a new problem for libraries - its another facet of any deduping process! For us working in the MARC record via python, we have to be aware of the data we’re using, both inside the MARC, and that we’re trying to match against. We might need to join two or more fields. We might need to remove the catalogers punctuation to help with clean matching.

There are a few other ways we can get to the same data that are worth exploring.

The data object that pymarc creates has a method called “value()”. We can use this to return the data field as text string without any subfield markers or indicators.

This is also a good opportunity to explore that “type()” method.

PyMARC gives us some convenient keywords to help us access key bits of data. One of those is the keyword .title().

Lets have a look a one record and what these various methods produce:

for record in reader:
    print (type(record))
    print ()
    print (record['245'])
    print (type(record['245']))
    print ()
    print (record['245'].value())
    print (type(record['245'].value()))
    print ()
    print (record['245']['a'])
    print (type(record['245']['a']))
    print ()
    print (record.title())
    print (type(record.title()))
    quit()

<class 'pymarc.record.Record'>

=245  10$aLarger than life :$bthe story of Eric Baume /$cby Arthur Manning.
<class 'pymarc.field.Field'>

Larger than life : the story of Eric Baume / by Arthur Manning.
<class 'str'>

Larger than life :
<class 'str'>

Larger than life : the story of Eric Baume /
<class 'str'>

We can use a similar approach to find a particular record. Lets say we’re looking for the record with the 001 identifier of 99628153502836. We can loop through the records in our pymarc.reader object and look for a match:

	for record in reader:
		if record['001'].value() == str(99628153502836):
			print ("Success! found record with id 99628153502836")

We can make this a little more useful by abstracting the search id into a variable:

	search_id = 99628153502836
	for record in reader:
		if record['001'].value() == str(search_id):
			print ("Success! found record with id {}.format(search_id)")

Using variables in code

What is main difference between these two code snippets? Why is it useful?

Solution

In the first example we’re hard coded the text we want to search for. Both in the search itself, and the successful response. In the second example we’ve used a variable to hold the identifier. This gets reused in both the search, and the successful response.

By doing this we’ve made the code much more useful. We can reuse the same code and answer more questions just by changing one variable. This is a good basic process to use when you’re writing your code. Untangling hard coded values can be a very time consuming and confusing process

More on data types…

Why do we need the str() conversion? Can you think of a different way of solving the same problem?

Solution

In Python the string "1234" is not the same as the integer 1234. When trying to match data with python we have to be careful to make sure we are matching across data types e.g. strings with strings, or integers with integers. If we don’t we will miss matches.

We didn’t have to convert the search term search_id into a string. We equally could have turned the record ID data into a integer if int(record['001'].value()) == search_id

There are pros and cons for each approach. The main thing we should be aware of is that numbers and strings are different things, and numbers as strings is a particularly common problem to encounter.

Lets make our search a little more interesting. Lets say we’re interested in in any record that contains the string “New Zealand” in the 245 field

    if "New Zealand" in record['245'].value():
        print (record['245'])

	=245  00$aThese fortunate isles :$bsome Russian perceptions of New Zealand in the nineteenth and early twentieth centuries /$cselected, translated and annotated by John Goodliffe.
	=245  02$aA Statistical profile of young women /$ccompiled for the YWCA of Aotearoa-New Zealand by Nic Mason, Dianna Morris and Angie Cairncross ; with the assistance of Shell New Zealand.
	=245  00$a1991 New Zealand census of population and dwellings.$pNew Zealanders at home.

Lets have a look at what happens if we use the .title() method as our string that we’re searching:

for record in reader:
    if "New Zealand" in record.title():
        print (record['245'])
        print (record.title())
        print ()

=245  00$aThese fortunate isles :$bsome Russian perceptions of New Zealand in the nineteenth and early twentieth centuries /$cselected, translated and annotated by John Goodliffe.
These fortunate isles : some Russian perceptions of New Zealand in the nineteenth and early twentieth centuries /

=245  00$a1991 New Zealand census of population and dwellings.$pNew Zealanders at home.
1991 New Zealand census of population and dwellings.

Notice this difference between these two scripts. This is a really good example of a paradigm we find in coding, explicit vs implicit.

In programming, implicit is often used to refer to something that’s done for you by other code behind the scenes. Explicit is the manual approach to accomplishing the change you wish to have by writing out the instructions to be done explicitly.

https://blog.codeship.com/what-is-the-difference-between-implicit-vs-explicit-programming/

In this case, the programmers behind the pymarc library have made some choices about which version of a title they think is the most straightforward. This is based on what the MARC standard says about a title, and how to interpret the field as a human being. We don’t necessarily know what those choices where, we simply trust that they made sensible one! This is an example of implicit coding. If we built our title parser, that read the field, and its associated subfields, and used some logic to turn that data into a single string like record.title() does, we would be explicitly coding this solution.

Key Points

  • We can print out any field from a MARC record we are interested in.

  • We can search for specific information in a set of MARC records.


Coffee Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

Key Points


Parsing with PyMARC - Part Two

Overview

Teaching: 30 min
Exercises: 30 min
Questions
Objectives
  • Finding specific fields with pymarc

  • finding specific data with pymarc

Episode 3: Parsing with pymarc

Finding specific fields with PyMARC

Carry on with the script file from the previosu episode episode_3.py

Lets keep looking at searching.

What about if we want any record that contains the string “New Zealand”? How might we adapt the code?

	if "New Zealand" in str(record):
		print (record) 
		print()

This could result in an overwhelming amount of data if we’re not careful. Lets have a think about how we might frame that question in a way that helps to refine the data into a useful form.

What happens if we ask to see any MARC field that contains our search string. We can do this by adding another loop. We also might want to think about how we associate each match with a particular record. One way of doing this would be to print the record ID as well as the matching field:

    if "New Zealand" in str(record):
        for field in record:
            if "New Zealand" in field.value():
                print (record['001'].value(), field)  
        print ()

	9962783502836 =500  \\$aEric Baume was a New Zealander.
	9962783502836 =650  \0$aAuthors, New Zealand$y20th century$xBiography.

	99627923502836 =245  00$aThese fortunate isles :$bsome Russian perceptions of New Zealand in the nineteenth and early twentieth centuries /$cselected, translated and annotated by John Goodliffe.
	99627923502836 =651  \0$aNew Zealand$xForeign public opinion, Russian$xHistory.

	99628063502836 =650  \0$aElectric power distribution$zNew Zealand$zBay of Plenty (Region)$xDeregulation.
	99628063502836 =650  \0$aElectric utilities$zNew Zealand$zBay of Plenty (Region)$xDeregulation.
	99628063502836 =650  \0$aPrivatization$zNew Zealand$zBay of Plenty (Region)

	99628093502836 =245  02$aA Statistical profile of young women /$ccompiled for the YWCA of Aotearoa-New Zealand by Nic Mason, Dianna Morris and Angie Cairncross ; with the assistance of Shell New Zealand.
	99628093502836 =650  \0$aYoung women$zNew Zealand$xStatistics.
	99628093502836 =710  20$aYWCA of Aotearoa New Zealand.
	99628093502836 =710  20$aShell Group of Companies in New Zealand.

	99628113502836 =650  \0$aChurch schools$zNew Zealand$zLower Hutt$xHistory.
	99628113502836 =650  \0$aSingle-sex schools$zNew Zealand$zLower Hutt$xHistory.

	99628123502836 =260  0\$a[Wellington, N.Z.] :$bScout Association of New Zealand,$c1992.
	99628123502836 =650  \0$aBoy Scouts$zNew Zealand$xHistory.
	99628123502836 =710  20$aScout Association of New Zealand.

	99628153502836 =610  20$aNew Zealand Fire Service Commission$xFinance.
	99628153502836 =650  \0$aFire departments$zNew Zealand$xFinance.
	99628153502836 =710  10$aNew Zealand.$bDepartment of Internal Affairs.
	99628153502836 =710  10$aNew Zealand.$bWorking Party on the Fire Service Levy.

	99628163502836 =245  00$a1991 New Zealand census of population and dwellings.$pNew Zealanders at home.
	99628163502836 =500  \\$a"New Zealanders' living arrangements, including their household composition and family type characteristics are the focus of this report. Information on private dwellings is also included"--Back cover.
	99628163502836 =500  \\$aCover title: 1991 census of population and dwellings : New Zealanders at home.
	99628163502836 =500  \\$aSpine title: Census 1991 : New Zealanders at home.
	99628163502836 =650  \0$aHouseholds$zNew Zealand$xStatistics.
	99628163502836 =650  \0$aFamilies$zNew Zealand$xStatistics.
	99628163502836 =650  \0$aCost and standard of living$zNew Zealand$xStatistics.
	99628163502836 =651  \0$aNew Zealand$xCensus, 1991.
	99628163502836 =651  \0$aNew Zealand$xPopulation$xStatistics.
	99628163502836 =710  10$aNew Zealand.$bDepartment of Statistics.
	99628163502836 =740  01$a1991 census of population and dwellings, New Zealanders at home.
	99628163502836 =740  01$aCensus 1991, New Zealanders at home.
	99628163502836 =740  01$aNew Zealanders at home.

We can use the same loop/iterator approach to process any field that has subfields. Lets see what that looks like:

	for record in reader:
		for field in record:
			for subfield in field:
				print (subfield)
		quit()

('z', '4260')
('a', '(nzNZBN)687856')
('9', '   67095940')
('a', '(Nz)3760235')
('a', '(NLNZils)6278')
('a', '(NLNZils)6278-ilsdb')
('a', '(OCoLC)957343')
('d', 'WN*')
('a', 'nznb')
('a', 'PN5596.B3')
...

For brevity we’ve only shown the first few fields.

Magic PyMARC methods…

Pymarc helps us out by providing a few useful methods that allow us to get fields and subfields much more elegantly.

Firstly, get_fields():

for record in reader:
	my_500s = record.get_fields('500')
	for my_500 in my_500s:
		print (my_500)

Before you consider the output of this code, consider the structure of the code itself.

Coding conventions

Can you see any coding conventions being used? What are they? Why are they being used?

Solution

In coding there are some strong rules about things we can/can’t do. These are important because if we don’t follow these rules our code doesn’t work. This includes things like not starting variable names with numbers or trying to use reserved terms like = in ways that the complier can’t process.

There are also coding conventions. These are less strict rules, but equally important. These are usually followed to aid readability, portability and shareability of our code. Pep 8 is a useful start for learning about common conventions https://www.python.org/dev/peps/pep-0008/

Conventions can also be quite personal, or agreed by a coding group.

In this script when we’re making ‘iterables’ (things that contain other things, like lists or dictionaries) we always pluralise the variable name e.g. for thing in things. This helps us to know the data type we are using.

In this script we’re adding the text “my_” to any item we’re creating. Its a convention thats used to show that the item being made/used “belongs” to a particular method. Its not something we need to pay particular attention to, but it is useful to notice it in code as we start to explore Python scripts.


=500  \\$aEric Baume was a New Zealander.
=500  \\$aIncludes index.
=500  \\$aAvailable from Fiesta Products, Christchurch, N.Z.
=500  \\$aIncludes index.
=500  \\$aCover title.
=500  \\$a"December 1992."
=500  \\$aCover title.
=500  \\$aSpine title: Chilton Saint James : 75 years.
=500  \\$aAvailable from Chilton Saint James School, P.O. Box 30090, Lower Hutt, N.Z.
=500  \\$aSpine title: 50 years of Tatum Park.
=500  \\$aCaption title.
=500  \\$a"March 1993"--Cover.
=500  \\$aCover title: Discussion document by the Working Party on Fire Service Levy.
=500  \\$a"Chaired by ... David Harrison."
=500  \\$aChiefly statistical tables.
=500  \\$a"New Zealanders' living arrangements, including their household composition and family type characteristics are the focus of this report. Information on private dwellings is also included"--Back cover.
=500  \\$aCover title: 1991 census of population and dwellings : New Zealanders at home.
=500  \\$aSpine title: Census 1991 : New Zealanders at home.
=500  \\$aOn cover: Census '91.
=500  \\$aInvalid ISSN on t.p. verso.
=500  \\$aIncludes advertising.
=500  \\$a"Catalogue number 02.226.0091"--T.p. verso.

We’re not limited to one field in this method:

for record in reader:
	my_500s = record.get_fields('500', '700')
	for my_500 in my_500s:
		print (my_500)

=500  \\$aEric Baume was a New Zealander.
=500  \\$aIncludes index.
=500  \\$aAvailable from Fiesta Products, Christchurch, N.Z.
=700  10$aGoodliffe, J. D.$d(John Derek)
=500  \\$aIncludes index.
=500  \\$aCover title.
...

Lets unpick a field that contains subfields:

for record in reader:
    my_245s = record.get_fields('245')
    for my_245 in my_245s:
        my_245_subfields = my_245.get_subfields('a', 'b', 'c', 'f', 'g', 'h', 'k', 'n', 'p', 's', '6', '8')
        for my_245_subfield in my_245_subfields:
            print (my_245_subfield)
    quit()


Larger than life :
the story of Eric Baume /
by Arthur Manning. 

We need to specify the field codes, or subfield codes for both of these methods. The list of subfield codes was made by referring to the MARC data page for the 245 field https://www.loc.gov/marc/bibliographic/concise/bd245.html

There one more trick to pymarc that helps us to parse a MARC record. In a previous episode we looked at the structure of MARC files, and noted where we can see the two indicators for any field. We can get to this data directly via pymarc:

for record in reader:
	print (record['245'])
	print ("Field 245 indicator 1: {}".format(record['245'].indicator1))
	print ("Field 245 indicator 2: {}".format(record['245'].indicator2))
	quit()

=245  10$aLarger than life :$bthe story of Eric Baume /$cby Arthur Manning.
Field 245 indicator 1: 1
Field 245 indicator 2: 0

We can use use this as another searching tool:

for record in reader:
	ind_2 =  record['245'].indicator2
	if ind_2 != '0':
		print (record['245'])
		print ()

=245  02$aA Statistical profile of young women /$ccompiled for the YWCA of Aotearoa-New Zealand by Nic Mason, Dianna Morris and Angie Cairncross ; with the assistance of Shell New Zealand.

=245  04$aThe Discussion document of the Working Party on Fire Service Levy /$b... prepared by the Working Party on the Fire Service Levy.

List of PyMARC methods associated with a record object

The pymarc record object has more useful data “shortcuts” that we won’t go into the specifics of, but are useful to know. These are all pre-built methods that get you to key data in a record, assuming (a) its been entered in the record and (b) its been entered in a way the pymarc developers expected to find it.

record.author()
record.isbn()
record.issn()
record.issn_title()
record.leader
record.location()
record.pos
record.publisher()
record.pubyear()
record.series()
record.sudoc()
record.title()
record.uniformtitle()
record.notes()
record.subjects()
record.physicaldescription()

If you have time, see what happens when you use these “shortcuts” to access bits of data in the pymarc record object.

Try looking at record and matching the data item you see, with the original record:

for record in reader:
	print (record)
	print ("\n______________________________\n\n")
	print (record.author())
	print (record.isbn())
	print (record.issn())

	quit()

Building a basic parser!

We can arrange some of the approaches we’ve looked at into a single script:

    for record in reader:
        print ("MMS ID:", record['001'].value())
        for my_field in record:
            #### Control fields (in the range 00x) don't have indicators. 
            #### We use this try/except catcher to allow us to elegantly handle both cases without encountering a breaking error, or a coding failure
            try:  
                ind_1 = my_field.indicator1
                ind_2 = my_field.indicator2

                #### Setting an empty indicator to a more conventional and readable "/"
                if ind_1 == " ":
                    ind_1 = "/"
                if ind_2 == " ":
                    ind_2 = "/"

                print ("\tTag #:", my_field.tag, "Indicator 1:", ind_1 , "Indicator 2:", ind_2)
            except AttributeError:
                print ("\tTag #:", my_field.tag)

            for my_subfield_key, my_subfield_value in my_field:
                print ("\t\t", my_subfield_key, my_subfield_value)
            print ()
        print ()
        quit()

MMS ID: 9962783502836
	Tag #: 001

	Tag #: 003

	Tag #: 005

	Tag #: 008

	Tag #: 035 Indicator 1: / Indicator 2: /
		 z 4260

	Tag #: 035 Indicator 1: / Indicator 2: /
		 a (nzNZBN)687856

	Tag #: 035 Indicator 1: / Indicator 2: /
		 9    67095940

	Tag #: 035 Indicator 1: / Indicator 2: /
		 a (Nz)3760235

	Tag #: 035 Indicator 1: / Indicator 2: /
		 a (NLNZils)6278

	Tag #: 035 Indicator 1: / Indicator 2: /
		 a (NLNZils)6278-ilsdb

	Tag #: 035 Indicator 1: / Indicator 2: /
		 a (OCoLC)957343

	Tag #: 040 Indicator 1: / Indicator 2: /
		 d WN*

	Tag #: 042 Indicator 1: / Indicator 2: /
		 a nznb

	Tag #: 050 Indicator 1: 0 Indicator 2: /
		 a PN5596.B3
		 b M3
	...

Have a play around with a building a parser that displays data that you’re interested in, using the above as a template.

Experiments!

What is the 001 identifier for the record with the OCLC identifier 39818086?

How many records have more than one 500 fields?

how many records describe an item with English as the primary language? (hint: We can look in the record['008'] field to find out the primary language)

Solution

(a) 99628093502836

for record in reader:
	for f in record.get_fields('035'):
		if "39818086" in f.value():
			print (record['001'].value())

(b) 4

for record in reader:
	my_500s = record.get_fields('500')
	print (len(my_500s))

(c) 10

for record in reader:
	if "eng" in record['008'].value():
		print (record['008'])

Key Points

  • We can search for specific information in a set of MARC records.

  • We understand how to explore fields, indicators and subfields in any MARC record.


Changing a record with PyMARC

Overview

Teaching: 30 min
Exercises: 60 min
Questions
Objectives
  • How to change some information in a record

  • How to delete some information from a record

  • How to add some information from a record

  • How to make a new record

Episode 5: Parsing with pymarc

Start a new file in your IDE episode_5.py

Set up the basic record reader like we have previously, this time we’re only going to process one record:

from pymarc import MARCReader
my_marc_file = "NLNZ_example_marc.marc"


with open(my_marc_file, 'rb') as data:
    reader = MARCReader(data)

Let set up a parser that will allows us to manipulate a single record. We already know we’re going to reuse this parser for the rest of this episode, so lets make sure we start with something thats well built.

As we know we will be changing a record in some way we’ll probably want to copy the record to a new object, and make our changes on that. Python has a particular trait around copying objects that we need to be aware of. If we use the basis assignation via an equals sign - a = b behind the scenes python essentially make a new pointer to original object. This means that a is not a copy of b, it IS b! Any changes to b are also in a. We can check this by asking python to tell us the internal identifier it uses to track the various objects:

a = ["Hello"]
# assigning b to be a
b = a
#printing b
print ("b =", b)
print (id(a))
print (id(b))
print ()

#changing a *only*!
a[0] = "World!"
#printing b
print ("b =", b)
print (id(a))
print (id(b))

b = ['Hello']
1983706174528
1983706174528

b = ['World!']
1983706174528
1983706174528

Your id number will be different to the one shown here, they are assigned by python at run time.

To make sure we make a new record that we can change without making changes to the original record we can use the python deepcopy() method to solve the problem:

from pymarc import MARCReader
from copy import deepcopy

my_marc_file = "NLNZ_example_marc.marc"

with open(my_marc_file, 'rb') as data:
	reader = MARCReader(data)
	for record in reader:
		my_record = deepcopy(record)

		print (id(record))
		print (id(my_record))

		quit()

Lets look at how we can change an existing piece of information in a record. Currently in our record we can see we have an author noted as Arthur Manning.

As an exercise, lets say that Arthur informed us that he isn’t in fact the author, his twin sister - Arthuretta is. We need to change this record to make sure its accurate!

    for record in reader:
        my_record = deepcopy(record)

        # we only need to update the 'a' subfield.
        # note the catalogers punctuation... we must include the commas. 
        my_record['100']['a'] = "Manning, Arthuretta,"

        #comparing the two
        print (record['100'])
        print (my_record['100'])

        quit()

=100  1\$aManning, Arthur,$d1918-
=100  1\$aManning, Arthuretta,$d1918-

Note: of course, the MARC 100 field is an authorised person - so we shouldn’t really do this unless there is an authority file for this new person!

Try for yourself

How would you change the date of birth date in the ‘d’ subfield to 1920?

Solution

my_record['100']['d'] = "1920-"

Removing information from a record

Lets see how we can remove a field. As an exercise lets say we need to remove the 300 field:

for record in reader:
    my_record = deepcopy(record)

    # We use the get_fields() method to generate a list of 300 get_fields
    # As there is only, we can just remove it. 
    for my_field in my_record.get_fields('300'):
        my_record.remove_field(my_field)

    #comparing the two
    print (record['300'])
    print (my_record['300'])

    quit()

This seems pretty straightforward. This is a simple case - there is only one 300 field, so we don’t need to do anything else to make sure we’re doing what we intended. One of the things to be aware of when we’re processing in bulk is writing scripts that have unintended consequences…

Unintended consequences

What do you think would happen if we used a field that has more than one, like 035?

Solution

They would all be removed.

One of the ways we try and mitigate unintended consequences is to build in checks to our script that help ensure that we only process things that fit our criteria. In this case the criteria is very simple, we want to delete the 300. There is actually a ‘hidden’ criteria that we’re implicitly addressing.

Unintended consequences - implicit requirements

What is the implicit requirement ‘hidden’ in the task “remove the 300 from our record”? What would be the impact of this on our process?

Solution

There might be an assumption that there is only one 300 field. If we assumed there was only ever one 300 field, and didn’t check, we might remove more fields than we expected to.

We have a few strategies to help with this problem.

  1. Check the standard. It may be that there is only one 300 field allowed in the record. This doesn’t always help - we may find non standards compliant records!
  2. Check the corpus. It might be sensible to check the dataset we are working with to test what we find in our records.
  3. Use some logic checks in our scripts to ensure we only remove 300 from a record where there is only one found in record.

Lets look at what #3 looks like in script.

with open(my_marc_file, 'rb') as data:
    reader = MARCReader(data)
    for record in reader:
        my_record = deepcopy(record)

        # We use the get_fields() method to generate a list of 300 get_fields
        my_fields = my_record.get_fields('300')
        
        # We test if this list of fields contains only one member
        if len(my_fields) == 1:
            print ("Only one 300 field found in record ID {}. Removing it.".format(record['001'].value()))
            for my_field in my_fields:
                my_record.remove_field(my_field)
        else:
            print ("More than one 300 field found in record ID {}. Doing nothing.".format(record['001'].value()))

        # comparing the two
        print ("Number of 300 fields in record:", len(record.get_fields('300')))
        print ("Number of 300 fields in my_record:", len(my_record.get_fields('300')))
        print()

        # testing the failing case 

                # We use the get_fields() method to generate a list of 300 get_fields
        my_fields = my_record.get_fields('035')
        
        # We test if this list of fields contains only one member
        if len(my_fields) == 1:
            print ("Only one 035 field found in record ID {}. Removing it.".format(record['001'].value()))
            for my_field in my_fields:
                my_record.remove_field(my_field)
        else:
            print ("More than one 035 field found in record ID {}. Doing nothing.".format(record['001'].value()))

        # comparing the two
        print ("Number of 035 fields in record:", len(record.get_fields('035')))
        print ("Number of 035 fields in my_record:", len(my_record.get_fields('035')))

        quit()

Only one 300 field found in record ID 9962783502836. Removing it.
Number of 300 fields in record: 1
Number of 300 fields in my_record: 0

More than one 035 field found in record ID 9962783502836. Doing nothing.
Number of 035 fields in record: 7
Number of 035 fields in my_record: 7

Lets look at how we might choose the field we want to delete when there are more than one. Lets delete the 035 field that contains the text “ilsdb”.

We can do that by testing for the presence of the string “ilsdb” in our various 035 fields.

We’re starting with these 035 fields - we can see that only one field has “ilsdb” in it, so its a safe test to use in this case:

=035  \\$z4260
=035  \\$a(nzNZBN)687856
=035  \\$9   67095940
=035  \\$a(Nz)3760235
=035  \\$a(NLNZils)6278
=035  \\$a(NLNZils)6278-ilsdb
=035  \\$a(OCoLC)957343
for record in reader:
    my_record = deepcopy(record)

    print (record)

    my_fields = my_record.get_fields('035')
    for my_field in my_fields:
        if "ilsdb" in my_field.value():
            my_record.remove_field(my_field)


    print (len(record.get_fields('035')))
    print (len(my_record.get_fields('035')))

    quit()

7
6

We end up with these:

=035  \\$z4260
=035  \\$a(nzNZBN)687856
=035  \\$9   67095940
=035  \\$a(Nz)3760235
=035  \\$a(NLNZils)6278
=035  \\$a(OCoLC)957343

This is only one approach of many to tackling this task. For any given task the solution might require checking field indicators, other fields, text in subfields etc.

We can use a very similar approach to removing subfields. Lets remove the ‘b’ subfield from the 100 field:

with open(my_marc_file, 'rb') as data:
    reader = MARCReader(data)
    for record in reader:
        my_record = deepcopy(record)
        my_fields = my_record.get_fields('100')
        for my_field in my_fields:
            my_field.delete_subfield('d') 

        print (record['100'])
        print (my_record['100'])

        quit()
=100  1\$aManning, Arthur,$d1918-
=100  1\$aManning, Arthur,

Adding information to a record

Lets look at how we can add a new field to a record. To do this we, we need to make a pymarc field object, and add it to the record. There are two different types of field in pymarc, a control field, and a non control field.

We can use the pymarc documentation to see how we can make a field data object:

from pymarc import Field

print (help(Field))
Help on class Field in module pymarc.field:

class Field(builtins.object)
 |  Field(tag, indicators=None, subfields=None, data='')
 |  
 |  Field() pass in the field tag, indicators and subfields for the tag.
 |  
 |      field = Field(
 |          tag = '245',
 |          indicators = ['0','1'],
 |          subfields = [
 |              'a', 'The pragmatic programmer : ',
 |              'b', 'from journeyman to master /',
 |              'c', 'Andrew Hunt, David Thomas.',
 |          ])
 |  
 |  If you want to create a control field, don't pass in the indicators
 |  and use a data parameter rather than a subfields parameter:
 |  
 |      field = Field(tag='001', data='fol05731351')
 ...

Ok, so it looks like we need to pass the Field() method the tag we want to use, the indicators, and the subfield data. Lets have go!

from pymarc import Field 

with open(my_marc_file, 'rb') as data:
    reader = MARCReader(data)
    for record in reader:
        my_record = deepcopy(record)
        ### making the new 245 field
        my_new_245_field = Field(

                            tag = '245', 

                            indicators = ['0','1'],

                            subfields = [
                                            'a', 'The pragmatic programmer : ',
                                            'b', 'from journeyman to master /',
                                            'c', 'Andrew Hunt, David Thomas.',
                                        ]
                            ) 
        ### adding the new field
        my_record.add_ordered_field(my_new_245_field)

        ### showing the diffence
        for original_245 in record.get_fields('245'):
            print (original_245)
     
        print ("______")

        for my_record_245 in my_record.get_fields('245'):
            print (my_record_245)

        quit()

=245  10$aLarger than life :$bthe story of Eric Baume /$cby Arthur Manning.
______
=245  10$aLarger than life :$bthe story of Eric Baume /$cby Arthur Manning.
=245  01$aThe pragmatic programmer : $bfrom journeyman to master /$cAndrew Hunt, David Thomas.

Lets have a look at the whole new record and double check things.

print (my_record)
=245  10$aLarger than life :$bthe story of Eric Baume /$cby Arthur Manning.
=245  01$aThe pragmatic programmer : $bfrom journeyman to master /$cAndrew Hunt, David Thomas.
=LDR  00912cam a2200301 a 4500
=001  9962783502836
=003  Nz
=005  20161223124839.0
=008  731001s1967\\\\at\ac\\\\\\\\\00010beng\d
=035  \\$z4260
=035  \\$a(nzNZBN)687856
=035  \\$9   67095940
=035  \\$a(Nz)3760235
=035  \\$a(NLNZils)6278
=035  \\$a(NLNZils)6278-ilsdb
=035  \\$a(OCoLC)957343
=040  \\$dWN*
=042  \\$anznb
=050  0\$aPN5596.B3$bM3
=082  0\$a823.2$220
=100  1\$aManning, Arthur,$d1918-
=245  10$aLarger than life :$bthe story of Eric Baume /$cby Arthur Manning.
=260  \\$aSydney [N.S.W.], :$aWellington [N.Z.] :$bReed,$c1967.
=300  \\$a184 p., [8] p. of plates :$bill., ports. ;$c23 cm.
=500  \\$aEric Baume was a New Zealander.
=600  10$aBaume, Eric,$d1900-1967.
=650  \0$aJournalists$zAustralia$xBiography.
=650  \0$aAuthors, New Zealand$y20th century$xBiography.
=245  01$aThe pragmatic programmer : $bfrom journeyman to master /$cAndrew Hunt, David Thomas.

Notice where the new field is. The add_field() method has added it to the end of the record.

Order of fields in a MARC record

Does a MARC record need to be sorted into ‘proper’ numerical order?

Solution

Sometimes it will, sometimes it won’t. As a data object the order doesn’t necessary need to be sorted numerically The MARC standard only stipulates that control fields have to come before data fields - “Structure of a MARC 21 Record”. As human readers we expect the item to be numerical. And its not unreasonable to assume that some downstream tool might expect the fields to be numerically sorted.

If we want to ensure our new field is added in the correct numerical sort position we use the add_ordered_field() method:

for record in reader:
    my_record = deepcopy(record)
    ### making the new 245 field
    my_new_245_field = Field(

                        tag = '245', 

                        indicators = ['0','1'],

                        subfields = [
                                        'a', 'The pragmatic programmer : ',
                                        'b', 'from journeyman to master /',
                                        'c', 'Andrew Hunt, David Thomas.',
                                    ]
                        ) 
    ### adding the new field
    my_record.add_ordered_field(my_new_245_field)

While we’re thinking about validation / what we expect our records to look like, its worth knowing that PyMARC doesn’t do much (anything…) by the way of data validation. t won’t prevent you from making a MARC item that isn’t compliant with the MARC standards.

Making a new record

Lets do one last task, and make a new record.

from pymarc import Record

my_new_record = Record()

print (my_new_record)
=LDR            22        4500

We’ve made a new empty record. All it contains is the minimum LEADER data required by a MARC record.

Try for yourself

Make a record that contains the following information

Tag Ind_1 Ind_2 Subfields and data
003     Nz
100 1   (a) Gattuso, Jay (d) d1978-
245 1 0 (a) Goats. Are they the best animals? (b) What about Cats!?
650   0 (a) Goats (b) Competitive Pet Keeping
650   0 (a) Cats (b) Competitive Pet Keeping

Solution

from pymarc import Record
my_new_record = Record()
my_new_fields = []
my_new_fields.append(Field('003', data='Nz'))
my_new_fields.append(Field(tag='100', indicators=['1',' '], subfields=['a','Gattuso, Jay,', 'd', 'd1978-']))
my_new_fields.append(Field(tag='245', indicators=['1','0'], subfields=['a','Goats. Are they the best animals? :', 'b', 'What about Cats!? /' ]))
my_new_fields.append(Field(tag='650', indicators=[' ','0'], subfields=['a','Goats', 'b', 'Competitive Pet Keeping']))
my_new_fields.append(Field(tag='650', indicators=[' ','0'], subfields=['a','Cats', 'b', 'Competitive Pet Keeping']))

for my_new_field in my_new_fields:
   my_new_record.add_ordered_field(my_new_field)

print (my_new_record)
=LDR            22        4500
=003  Nz
=100  1\$aGattuso, Jay,$dd1978-
=245  10$aGoats. Are they the best animals? :$bWhat about Cats!? /
=650  \0$aGoats$bCompetitive Pet Keeping
=650  \0$aCats$bCompetitive Pet Keeping

Key Points

  • We can manipulate a MARC record with PyMARC

  • We can change the information in a record

  • We can add a field to record

  • We can make a new record


Saving MARC Files

Overview

Teaching: 5 min
Exercises: 15 min
Questions
Objectives
  • How to save MARC records

Episode 6: Saving MARC Files pymarc

We only have one task left. Saving our MARC records so we can ingest / use / share / etc them.

Start a new file in your IDE episode_6.py

PyMARC comes with some useful tools that make this a very quick process.

from pymarc import MARCReader

my_marc_file = "NLNZ_example_marc.marc"

### We'll add records to this list. It can have a membership of 1 
my_marc_records = []
with open(my_marc_file, 'rb') as data:
    reader = MARCReader(data)
    for record in reader:
        ### we add out records to our list of records
        my_marc_records.append(record)
        ### and print the IDs so we can see whats happening
        print (record['001'])

# we create a new file
my_new_marc_filename = "my_new_marc_file.marc" 
with open(my_new_marc_filename , 'wb') as data:
    for my_record in my_marc_records:
        ### and write each record to it
        data.write(my_record.as_marc())

print ()
print ("___")
print ()

### we open the new marc file
with open(my_new_marc_filename, 'rb') as data:
    reader = MARCReader(data)
    for record in reader:
        ### and print the IDs so we can validate we save all the records
        print (record['001'])

=001  9962783502836
=001  99627723502836
=001  99627923502836
=001  99628033502836
=001  99628063502836
=001  99628093502836
=001  99628113502836
=001  99628123502836
=001  99628153502836
=001  99628163502836

___

=001  9962783502836
=001  99627723502836
=001  99627923502836
=001  99628033502836
=001  99628063502836
=001  99628093502836
=001  99628113502836
=001  99628123502836
=001  99628153502836
=001  99628163502836

Most of the this script is helper code to show whats going!. If we strip it down to the basics, and assuming we have a list of records to save its only a few lines:



my_new_marc_filename = "my_new_marc_file.marc" 
with open(my_new_marc_filename , 'wb') as data:
    for my_record in my_marc_records:
        data.write(my_record.as_marc())


We’re not limited to a marc binary. PyMARC has a few options that we could use if we needed something else:

data.write(my_record.as_dict())
data.write(my_record.as_json())
data.write(my_record.as_marc())
data.write(my_record.as_marc21())

Key Points

  • We can save a MARC record to a suitable format

  • We can save MARC records to a suitable format