This article will explore learning about serialization, when to use it, and when not to. Further, we will look at the compression of pickled objects, multiprocessing, and more with the Python Pickle module.
Serialization and Deserialization
For serializing and deserializing a Python object structure, the Python pickle module is used. Pickling an object in Python allows it to be saved on disk. Pickle first “serializes” the object before writing it to disk. In fact, pickling is a Python function that converts a list, dict, or other Python object into a character stream. The idea is that this character stream contains all of the data needed to recreate the object in another Python script. For instance, if you want to save an object to a disk file or your cache and retrieve it later.
As a data scientist, you’ll work with dictionaries, DataFrames, and other data types to build data sets. You may want to save them to a file so that you can use them later or give them to someone else. The pickle module in Python does just that: it serializes objects to be saved to a file and loaded into a program later.
Pickling is most useful for data analysis when performing repetitive tasks on the data, such as pre-processing. Even when dealing with Python-specific data types like dictionaries, it makes a lot of sense.
For example, in the NLTK sequence, pickling is used to save qualified machine learning algorithms. Thus, we don’t have to retrain it every time we want to use it, which takes a long time.
Benefits for using the Pickle Module:
- Pickle maintains track of the objects it has already serialized, so subsequent references to the same object aren’t serialized again because it causes the marshal module to crash.
- Pickle stores the object once and ensures that all other references point to the master copy. Object sharing references to the same object in different places. This is similar to self-referencing objects; the pickle stores the object once and ensures that all other references point to the master copy. Shared objects are still shared, which is crucial for mutable objects.
- Pickle can save and restore class instances transparently, while Marshal doesn’t support user-defined classes and their instances at all. The class definition must be able to be imported and stored in the same module as the object.
The following topics will be included in this tutorial:
- What exactly is pickling?
- What would you do with a pickle?
- When do you not use pickles?
- What kinds of things can be pickled?
- Pickle Features
- JSON vs Pickle
- Files for pickling
- Organizing files
- Pickle file compression
- Python 3: Unpicking Python 2 Items
- Multiprocessing and pickling
What exactly is pickling?
Pickle is a Python object serializer and de-serializer that is also known as marshalling or flattening. Serialization transforms a memory object into a byte stream that can be saved on disk or sent over the internet. This character stream can then be extracted and deserialized into a Python object at a later time. Contrary to popular belief, pickling is not the same as compression. The former is converting an object from one representation data in the random access memory (RAM) to another text on disk. However, the latter is the process of encoding data with fewer bits to conserve disk space.
Pickle: What will you do about It?
Pickling is useful for applications that need some level of data persistence. The state data of your software can be saved to disk so that you can continue working on it later. It can also store python objects in a database or send data over a Transmission Control Protocol (TCP) or socket link. Pickle comes in handy when working with machine learning algorithms and need to store them so you can make new predictions later without having to rewrite anything or train the model from scratch.
When you shouldn’t use Pickles
Pickle is not recommended if you choose to use data across several programming languages. Since its protocol is Python-specific, cross-language compatibility cannot be assured. The same is true for various Python versions. Unpickling a file that was pickled in a different version of Python can not always function, so make sure you’re on the same version and upgrade if possible. You should also avoid unpickling data from an unreliable source. Unpickling the file may result in the execution of malicious code.
What kinds of things can be pickled?
Objects with the following data types can be pickled:
Picklable objects can be found in Booleans, Integers, Floats, Complex numbers, (normal and Unicode) Strings, Tuples, Lists, Sets, and Dictionaries.
All of the above can be pickled, but you can also pickle classes and functions specified at the module’s top-level, for example.
Generators, inner classes, lambda functions, and defaultdicts are examples of things that can’t be pickled easily. When it comes to lambda functions, you’ll need to use the dill bundle. With defaultdicts, you must use a module-level function to build them.
Pickle Features
Pickle Characteristics:
- It is primarily intended for use with Python scripts
- It is used to save python objects as they are passed between processes
- It maintains track of all serialized objects, and objects that have already been serialized can not be serialized again
- It can save and straightforwardly restore class instances
- It is dangerous to use. As a result, it is not a good idea to unpick data from an unknown source
For serialization, use dump():
The dump() function transforms the object data into a character stream until saving to a file. Three arguments can be passed to this function. The first two arguments are needed, while the third is optional. A data object that needs to be serialized is passed in as the first argument. The second argument is the file handler object of the file used to store the pickled data. The protocol name is passed as the final argument.
dump(data_object, file_object, [protocol])
Deserialization: load()
To transform character stream data from a file into a Python object, use the load() function. The file handler object of the file is passed as the argument value for this purpose, and it is from this object that the data will be retrieved.
load(file_object)
To save in a register or file, pickle a simple Object
With the following python script, create a file called sport.py. A data object called dataObject is declared in the following script to store five language names by iterating the for loop. The open() method is used to delegate a file handler to a binary file called languages. The dump() function is used to serialize dataObject data and save it to a languages register. If the serialization is performed correctly, the message “Data is serialized” will appear on the screen.
# sport.py # Import the pickle module import pickle # Declare the object to store data dataObject = [] # Iterate the for loop for 5 times and take sport names for n in range(5): raw = input('Enter name for a sport :') dataObject.append(raw) # Open a file for writing data file_handler = open('sport', 'wb') # Dump the data of the object into the file pickle.dump(dataObject, file_handler) # close the file handler to release the resources file_handler.close() # Print message print('Data is serialized')
Data from a file should be unpicked
The process of unpickling data is the polar opposite of data pickling. With the following python script, create a file called sport_deserialize.py. The open() method is used to open the binary file called languages generated earlier in this example. The data from the file is unpickled and stored in the variable dataObject using the load() function. The data from the dataObject is then iterated and printed in the terminal using the for loop.
# sport_deserialization.py # Import the pickle module import pickle # Open a file handler for reading a file from where the data will load file_handler = open('sport', 'rb') # Load the data from the file after deserialization dataObject = pickle.load(file_handler) # Close the file handler file_handler.close() # Print message print('Data after deserialization') # Iterate the loop to print the data after deserialization for val in dataObject: print('The data value : ', val)
Pickle a Class Object and save it as a file
The following example demonstrates how to pickle a class object. With the following script, create a file called employee.py. The latter is a class that is used to assign three data values to an employee. Then, to open a file for writing, a file handler object called fileHandler is created. After initializing the class object, data is serialized and saved in the employee_info file using the dump() feature. If the file is properly created, the message “Data is serialized” will appear.
# employee.py # Import pickle module import pickle # Declare the employee class to store the value class Employee: def __init__(self, name, email, post): self.name = name self.email = email self.post = post # Create employee object empObject = Employee('james', 'james01@gmail.com', 'CEO') # Open file for store data fileHandler = open('employee_info', 'wb') # Save the data into the file pickle.dump(empObject, fileHandler) # Close the file fileHandler.close() # Print message print('Data is serialized')
Unpickle data in a Class Object
A class with all of the requisite properties and methods would need to declare a method for extracting data from a file and storing it in a class object. Put the following code in a file called employee_deserialize.py. To retrieve the data, the Employee class is specified here. The fileObject variable is used to open the file, and the employee_info variable is used to read it. After deserialization, the data is stored in the class object using the load() function. The Employee class’s display() function is used to print the data values of the class object.
# Import pickle module import pickle # Declare employee class to read and print data from a file class Employee: def __init__(self, name, email, post): self.name = name self.email = email self.post = post def display(self): print('Employee Detils:') print('Employee Name :', self.name) print('Employee Email :', self.email) print('Employee Position :', self.post) # Open the file for reading fileObject = open('employee_info', 'rb') # Unpickle the data employee = pickle.load(fileObject) # Close file fileObject.close() # print the dataframe employee.display()
JSON vs Pickle
JavaScript Object Notation (JSON) is an acronym for JavaScript Object Notation. It’s a simple data-transfer format that humans can understand. JSON is a structured and language-independent format that was derived from JavaScript. As a result, JSON has a significant benefit over the pickle. It’s much safer and more productive than pickle.
If all you need is Python, the pickle module is still a good option because of its ease of use and ability to rebuild complete Python objects.
On the other hand, the JSON module can only handle a small number of Python types, while pickle can handle a much greater number of types, including custom classes.
cPickle is an alternative. It’s almost similar to pickle, except it’s written in C, so it’s 1000 times faster. Small files, on the other hand, will not realize the difference in speed. Pickle and cPickle both produce the same data streams, so they can use the same files.
Finally, though using the JSON module to deserialize JSON data is secure, you should never unpickle untrusted data because it could contain malicious pickled data that can execute arbitrary code when unpickled.
The key differences between the JSON and pickle modules are mentioned below.
- The pickle module is Python-specific, which means that it cannot be deserialized using PHP, Java, Perl, or other languages once an object has been serialized. If you need interoperability, use the JSON module.
- The pickle module serializes data in binary format, unlike the JSON module, which serializes objects as human-readable JSON strings.
- Only the most basic Python forms can be serialized using the JSON module (like int, str, dict, list etc.). You’ll have to write your serialization function if you need to serialize custom objects. On the other hand, the pickle module works with a wide range of Python styles right out of the box, including your custom pieces.
- The pickle module is written in C for the most part. As a result, compared to the JSON module, it provides a significant performance boost when dealing with massive data sets.
Files for pickling -fruits
In the initial stage to use pickle, you must first import it into Python.
import pickle
You will be pickling a basic dictionary for this tutorial. A dictionary is a set of key: value pairs. You’ll save it as a file, then reload it. Declare the dictionary to be as follows:
fruits_dict = { 'Mango': 3, 'Apple': 8, 'Passion': 5, 'melon': 10, 'Quava': 12, 'Lemon': 9, 'Banana': 16 }
To pickle this dictionary, you must first state the name of the file to which it will be written, which in this case is fruits.
It’s worth noting that the file has no extension.
Simply use the open() function to open the file for writing. The name of your file should be the first argument. ‘wb’ is the second argument. The w stands for writing to the file, and the b stands for binary mode. It indicates that the data will be written as byte objects. If you forget the b, you’ll get a TypeError: must be str, not bytes. You can sometimes encounter a slightly different notation, such as w+b, but don’t worry; it performs the same role.
filename = 'fruits' output_file = open(filename,'wb')
pickle.dump(), which takes two arguments: the object to pickle and the file to which the object must be saved. It can also be used after the file has been opened for writing. The former will be fruits_dict, and the latter will be output_file in this situation.
Don’t forget to use close() to close the file!
pickle.dump(fruits_dict,output_file) output_file.close()
Now, in the same directory as your Python script, a new file called fruits should have appeared unless you specified a file path as the file name. The complete code is as follows:
# fruits.py # Import pickle module import pickle fruits_dict = { 'Mango': 3, 'Apple': 8, 'Passion': 5, 'melon': 10, 'Quava': 12, 'Lemon': 9, 'Banana': 16 } filename = 'fruits' output_file = open(filename,'wb') pickle.dump(fruits_dict,output_file) output_file.close() # Print message print('Data is serialized')
Unpickling files -fruits
The procedure for reloading a pickled file into a Python program is the same as before: use the open() function again, but this time with ‘rb’ as the second statement instead of wb. The r denotes read mode, while the b denotes binary mode. You’re going to read a binary file then assign this to input_file. Then, with input_file as an argument, use pickle.load() to assign it to new_dict. This new variable now contains the contents of the file. You’ll need to close the file once more at the end.
input_file = open(filename,'rb') new_dict = pickle.load(input_file) input_file.close()
You should print the dictionary, compare it to the previous dictionary, and compare the types of both dictionaries to ensure that you effectively unpickled it.
print(new_dict) print(new_dict==fruits_dict) print(type(new_dict))
The complete code for unpickling the fruits is:
# fruits_deserialize.py # Import pickle module import pickle fruits_dict = { 'Mango': 3, 'Apple': 8, 'Passion': 5, 'melon': 10, 'Quava': 12, 'Lemon': 9, 'Banana': 16 } input_file = open('fruits','rb') new_dict = pickle.load(input_file) input_file.close() print(new_dict) print(new_dict==fruits_dict) print(type(new_dict))
Pickle file compression
You may want to compact your pickled file if you’re saving a big dataset and it’s taking up a lot of room. Thus, it can be accomplished with either bzip2 or gzip. Both compress files, but bzip2 is slightly faster. gzip, on the other hand, generates files that are about twice as huge as bzip2. In this tutorial, you’ll be using bzip2.
Keep in mind that compression and serialization are not synonymous! If you need to refresh your mind, go back to the beginning of the article.
Begin by importing bzip2 using the import bz2 command. Pickle is imported in the same manner as it was at the start of this article.
#file_compression.py import bz2 import pickle fruits_dict = { 'Mango': 3, 'Apple': 8, 'Passion': 5, 'melon': 10, 'Quava': 12, 'Lemon': 9, 'Banana': 16 } small_file = bz2.BZ2File('smallerfile', 'w') pickle.dump(fruits_dict, small_file ) # Print message print('file compression is serialized')
Smallerfile should have emerged as a new file. Keep in mind that the difference in file size between compressed and uncompressed versions will not be apparent with small object structures.
Python 3: Unpicking Python 2 Items
When using Python 3, you can come across items that were pickled in Python 2. Unpickling this can be a pain.
You could either run Python 2 to unpickle or use Python 3’s load() function with encoding=’latin1′.
input_file = open(filename,'rb') new_dict = pickle.load(input_file, encoding='latin1')
If your objects contain NumPy arrays, this will not work. You may also use encoding=’bytes’ in that case:
input_file = open(filename,'rb') new_dict = pickle.load(input_file, encoding='bytes')
Multiprocessing and pickling
When doing extremely complex computations, it’s normal to split the job into many processes. Multiprocessing refers to the execution of multiple processes simultaneously, typically through multiple Central Processing Units (CPUs) or CPU cores, thus saving time. Training machine learning models or neural networks, for example, is a time-consuming and intensive process. A lot of time can be saved by spreading these across a large number of processing units.
The multiprocessing package in Python is used to accomplish this.
When a job is divided into multiple phases, data can need to be shared between them. Since processes do not share a memory, they must use serialization to transfer data to one another, which is accomplished using the pickle module.
Begin by importing multiprocessing as mp and cos from math in the following example. Then, build a Pool abstraction that allows you to define the number of processors to be used. The multiprocessing will be handled in the background by the Pool. The cos function on a range of 10 can then be mapped to the Pool and performed.
import multiprocessing as mp from math import cos p = mp.Pool(4) p.map(cos, range(20))
As you can see, the cos function is performed flawlessly. However, it will not always be the case. It’s important to keep in mind that lambda functions can’t be pickled. As a result, attempting to apply multiprocessing to a lambda function would fail.
p.map(lambda x: 2**x, range(20))
There is a way out of this. dill is a pickle-like package that, among other things, can serialize lambda functions. It’s used in a similar way to pickle.
import dill dill.dump(lambda x: x**4, open('dill_file','wb'))
You’ll need to use pathos.multiprocessing, a fork of multiprocessing if you want to use multiprocessing with a lambda feature or other data types that pickle doesn’t support.
The latter, pathos.multiprocessing, is a term used to describe the method of serialization instead of pickle. Creating a Pool and assigning a lambda function to it follows the same steps as before.
import pathos.multiprocessing as mp p = mp.Pool(4) p.map(lambda x: 4**x, range(20))
There is no PicklingError this time!
Example 1: pickling program: store and read
# pickling_program.py """ - program illustrating how to store efficiently using the pickle module - The module translates an in-memory Python object into a serialized byte stream—a string - of bytes that can be written to a file-like object """ import pickle def writeData(): # initializing data to be stored in db Mango = {'key': 'Mango', 'name': 'Mangifera indica', 'weight': 21, 'pay': 100} Apple = {'key': 'Apple', 'name': 'Malus domestica', 'weight': 50, 'pay': 250} # database db = {} db['Mango'] = Mango db['Apple'] = Apple # using binary mode is vital db_file = open('fruits_pickle', 'ab') # source, destination pickle.dump(db, db_file) db_file.close() def readData(): # binary mode is critical when reading db_file = open('fruits_pickle', 'rb') db = pickle.load(db_file) for keys in db: print(keys, '=>', db[keys]) db_file.close() # driver code for the pickling program if __name__ == '__main__': print("###### initiate storing the data ###### : ") writeData() print("\n\n\n ###### start loading the data ###### : ") readData()
Example 2: pickling without a file
# pickling_without_file.py # Import pickle module import pickle """ Pickling without a file """ # data to be stored in the db is initialized Mango = {'key' : 'Mango', 'name' : 'Mangifera indica', 'weight' : 21, 'pay' : 100} Apple = {'key' : 'Apple', 'name' : 'Malus domestica', 'weight' : 50, 'pay' : 250} # database db = {} db['Mango'] = Mango db['Apple'] = Apple # driver code for the pickling program if __name__ == '__main__': # storing b = pickle.dumps(db) # type(b) gives <class 'bytes'> # loading my_entry = pickle.loads(b) print(my_entry)
Final Thoughts
Python’s Pickle module is useful for data serialization and deserialization. After completing the examples in this guide, copying data from one Python script to another would be much simpler for anyone. Congratulations on your achievement! You’re now ready to use Python to pickle and unpickle files. You’ll be able to save your machine learning models and return to them later to continue working on them. If you’re interested in learning more about building predictive models in Python, check out supervised learning with scikit-learn. That is where this advanced content is greatly elaborated. It will teach you everything you need to know about machine learning in Python.