Updated On : Nov-20,2021 Tags awkward-array, json-like…

Awkward Array: Guide to Work with JSON-like Nested, Variable-sized Datasets using Numpy-like Idioms

Datasets available from various sources for data analysis tasks are not always clean and in a structured format like database tables. One such example of the dataset is tree-like datasets (JSON). The tree-like datasets (JSON) are nested datasets that can have more than one level and different levels have a different set of elements. It accepts arrays as well as dictionaries. Python has famous fast and optimized libraries like numpy and pandas to work with arrays and structured datasets. But numpy arrays are fixed-length rectangular tables that can not be used to represent variable-length arrays. We can not have a numpy array where the first row has 2 elements, the second has 4, and so on.

The most common way to represent tree-like data structure is JSON but operations on JSON are not fast and optimized like numpy array operations. In order to solve the problem of working with tree-like data structures with a speed of numpy-like operations, awkward array library was designed. The library awkward array was originally designed for high-energy particle physics which has complicated tree-like data structures which can not be flattened to numpy arrays. The datasets of particle physics are huge and the people working with them needed a library that can work with such datasets like numpy handles arrays. The awkward array let us work with tree-like nested data structure as if we are working with numpy arrays. The code designed using awkward array to work with tree-like data structure is fast like numpy code working on arrays.

As a part of this tutorial, we'll explain how to create and work with awkward arrays with simple examples. We'll try to cover as much API as possible of the library. Below we have highlighted important sections of the tutorial to give an overview of the material that we have covered.

Important Sections of Tutorial

  1. Awkward Array Creation
  2. Awkward Array Indexing/Slicing
  3. Awkward Array Attributes
  4. Normal Array Operations
  5. Simple Statistics on Awkward Arrays

Installation

We can install awkward array simply using the below pip command.

  • pip install awkward

We have imported an awkward array library and printed the version that we'll be using in our tutorial.

In [1]:
import awkward as ak

print("Awkward Array Version : {}".format(ak.__version__))
Awkward Array Version : 1.5.1

1. Awkward Array Creation

In this section, we'll explain how we can create Awkward arrays. There are various ways to create awkward arrays. We'll try to explain the majority of them. The arrays that we create in this section will be used in upcoming sections where we explain indexing and other features.

Below we have created normal JSON-like data using our python constructs. We have lists inside of the list. There are 4 lists inside of the list. All list has one or more dictionaries present in them. The keys of dictionaries are x, y. The values of key x is a list of variable lengths. The values of key y is another dictionary with key z. The values of key z is a list of variable lengths. We can notice from the below data structure that it does not strictly follow any structure. The list has a different number of elements, the keys of dictionaries are present in some elements and not in some. If we have a very large dataset with such a structure and we have to loop through it to perform some stats then it'll take a lot of time. Thanks to Awkward array we can work on it as a numpy array with less line of code and faster performance. We'll be converting the below data structure to Awkward array next.

In [2]:
arr = [
    [ {"x": [1,2,3], "y": {"z": [4,5,6] }},
      {"x": [7,8,9,10], "y": {"z": [11,12] }}],
    [ {"x": [13,14,15,16,17,18]},
      {"y": {"z": [19,] }}],
    [],
    [ {"x": [20,21,22,23], "y": {"z": [24,25,26,27,28] }},
      {"x": [29,30,31,32,33,34,35,36],},
      {"y": {"z": [37,38] }}
    ]

]

Array() Constructor

The most common way to create an Awkward array is by using Array() constructor available from awkward. This constructor can accept a list of data types like numpy array, python lists, python dictionaries, iterators, strings, etc as input and creates an Awkward array from it.

Below we have created an awkward array using Array() constructor by giving it the data structure we had created in the previous cell. We have also printed the array for display purposes.

In [3]:
ak_arr = ak.Array(arr)

ak_arr
Out[3]:
<Array [[{x: [1, 2, 3], ... z: [37, 38]}}]] type='4 * var * {"x": option[var * i...'>

The awkward array has an important attribute named type which returns type information about the underlying array. We can notice from the type information below that says there are 4 elements in an array of variable-sized, Each element has a dictionary with keys x and y. The values of key x is variable length and values of key y is a dictionary with key z. The type also specifies the type of element which is int64.

Please make a NOTE that awkward array also let us mix different data types which is another plus point of using it.

In [4]:
ak_arr.type
Out[4]:
4 * var * {"x": option[var * int64], "y": ?{"z": var * int64}}

from_iter()

We can also create an awkward array using from_iter() method available from the library. It accepts python iterable as input. Below we have given our data structure from earlier as input and it works fine with the method which creates awkward array from it.

In [5]:
ak_arr2 = ak.from_iter(iter(arr))

ak_arr2
Out[5]:
<Array [[{x: [1, 2, 3], ... z: [37, 38]}}]] type='4 * var * {"x": option[var * i...'>
In [6]:
ak_arr2.type
Out[6]:
4 * var * {"x": option[var * int64], "y": ?{"z": var * int64}}

ones_like()

The awkward array library provides methods like NumPy’s to create an array. It provides a method named one_like() which takes as input another awkward array as input and returns a new awkward array where all elements are replaced with 1.

Below we have given our awkward array from previous cells as input to a method and we can notice that all integer elements are replaced with 1.

In [7]:
ak.ones_like(ak_arr)
Out[7]:
<Array [[{x: [1, 1, 1], y: {, ... z: [1, 1]}}]] type='4 * var * {"x": option[var...'>

zeros_like()

The zeros_like() method works exactly like ones_like() with only difference that it replaces all elements with zeros.

In [8]:
ak.zeros_like(ak_arr)
Out[8]:
<Array [[{x: [0, 0, 0], y: {, ... z: [0, 0]}}]] type='4 * var * {"x": option[var...'>

full_like()

The full_like() method works like ones_like() and zeros_like() but replaces input array elements with the element specified as second argument of the method call.

Below we have replaced all array values with 100.

In [9]:
ak.full_like(ak_arr, 100)
Out[9]:
<Array [[{x: [100, 100, 100], ... 100, 100]}}]] type='4 * var * {"x": option[var...'>

from_json()

The from_json() method takes as input JSON file name or JSON formatted contents as string. It then converts it to awkward array and returns it.

Below we have loaded GeoJSON file which has information about US states. We have downloaded the file and kept it in a datasets folder. The GeoJSON dataset is a good example where individual elements do not follow any structure. The Polygons and Multi Polygons representing different states can have a different number of elements. The US states GeoJSON dataset can be easily downloaded from the internet with a simple google search.

We can notice from the output printed that it’s of type Record the reason behind this is that file has one dictionary where actual data is kept in features key of that dictionary. When we access features key of the dictionary below in the next cell, we can notice that it prints awkward array.

In [10]:
us_states = ak.from_json("datasets/us-states.json")

us_states
Out[10]:
<Record ... [-111, 45], [-109, 45]]]}}]} type='{"type": string, "features": var ...'>
In [11]:
us_states["features"]
Out[11]:
<Array [{type: 'Feature', id: 'AL', ... ] type='50 * {"type": string, "id": stri...'>

Below we have loaded the contents of the GeoJSON file first as a string and then created an awkward array from it using from_json() method. The results will be the same as if we have given the input file name.

In [12]:
file_contents = open("datasets/us-states.json").read()

us_states = ak.from_json(file_contents)

us_states
Out[12]:
<Record ... [-111, 45], [-109, 45]]]}}]} type='{"type": string, "features": var ...'>
In [13]:
us_states["features"]
Out[13]:
<Array [{type: 'Feature', id: 'AL', ... ] type='50 * {"type": string, "id": stri...'>

from_numpy()

The from_numpy() method lets us create awkward array from the numpy array.

Below we have created a simple awkward array from a random data numpy array.

In [14]:
import numpy as np

ak.from_numpy(np.random.rand(10,10))
Out[14]:
<Array [[0.732, 0.671, ... 0.226, 0.932]] type='10 * 10 * float64'>

Awkward Array from GeoJSON Data for Future Examples

In this section, we have created another awkward array using GeoJSON dataset that we'll be using in upcoming sections of our tutorial for explanation purposes.

Below we have first loaded the US states GeoJSON dataset as a python dictionary. We have then read information about the US states population from a CSV file (US States Population 2018). We have populated the GeoJSON data with state population using a simple loop.

We have then printed the first and last few characters of our final GeoJSON dataset for displaying the contents of it.

In [15]:
import numpy as np
import pandas as pd
import json


us_states = json.load(open("datasets/us-states.json"))

print("Data Keys : {}\n".format(us_states.keys()))

us_states = us_states["features"]

df = pd.read_csv("datasets/State Populations.csv")

state_to_population = dict(zip(df["State"], df["2018 Population"]))

for feature in us_states:
    state_name = feature["properties"]["name"]
    feature["properties"]["population"] = state_to_population.get(state_name, None)

print("Data Overview (Start): ")
print(json.dumps(us_states, indent=4)[:500])
print("\nData Overview (End): ")
print(json.dumps(us_states, indent=4)[-500:])
Data Keys : dict_keys(['type', 'features'])

Data Overview (Start):
[
    {
        "type": "Feature",
        "id": "AL",
        "properties": {
            "name": "Alabama",
            "population": 4888949
        },
        "geometry": {
            "type": "Polygon",
            "coordinates": [
                [
                    [
                        -87.359296,
                        35.00118
                    ],
                    [
                        -85.606675,
                        34.984749
                    ],


Data Overview (End):
         [
                        -111.047063,
                        42.000709
                    ],
                    [
                        -111.047063,
                        44.476286
                    ],
                    [
                        -111.05254,
                        45.002073
                    ],
                    [
                        -109.080842,
                        45.002073
                    ]
                ]
            ]
        }
    }
]

As GeoJSON datasets include information about geometry it represents, the geometries can be represented using a different kind of shapely objects like Polygon, Multi-Polygon, Lines, etc. In our case, as we have loaded US states GeoJSON data, it has state boundaries represented using either Polygon or MultiPolygon.

Below we have retrieved the first element of our dataset and then printed information about it. The first element is Polygon. We have printed information about coordinates of Polygon as well as property name which holds state name.

In the next cell below, we have retrieved the second element of our dataset and printed information about it which is MultiPolygon.

Please notice the difference between the shape of Polygon and MultiPolygon.

In [16]:
print("Shape Type : {}".format(us_states[0]["geometry"]["type"]))

polygon = us_states[0]["geometry"]["coordinates"]

print("State Name : {}\n".format(us_states[0]["properties"]["name"]))

print("Polygon Shape : {}".format(np.array(polygon).shape))
print("Polygon Data : ")
print(polygon)
Shape Type : Polygon
State Name : Alabama

Polygon Shape : (1, 33, 2)
Polygon Data :
[[[-87.359296, 35.00118], [-85.606675, 34.984749], [-85.431413, 34.124869], [-85.184951, 32.859696], [-85.069935, 32.580372], [-84.960397, 32.421541], [-85.004212, 32.322956], [-84.889196, 32.262709], [-85.058981, 32.13674], [-85.053504, 32.01077], [-85.141136, 31.840985], [-85.042551, 31.539753], [-85.113751, 31.27686], [-85.004212, 31.003013], [-85.497137, 30.997536], [-87.600282, 30.997536], [-87.633143, 30.86609], [-87.408589, 30.674397], [-87.446927, 30.510088], [-87.37025, 30.427934], [-87.518128, 30.280057], [-87.655051, 30.247195], [-87.90699, 30.411504], [-87.934375, 30.657966], [-88.011052, 30.685351], [-88.10416, 30.499135], [-88.137022, 30.318396], [-88.394438, 30.367688], [-88.471115, 31.895754], [-88.241084, 33.796253], [-88.098683, 34.891641], [-88.202745, 34.995703], [-87.359296, 35.00118]]]
In [17]:
print("Shape Type : {}".format(us_states[1]["geometry"]["type"]))

multi_polygon = us_states[1]["geometry"]["coordinates"]

print("State Name : {}\n".format(us_states[1]["properties"]["name"]))

print("Number of Polygons : {}\n".format(len(multi_polygon)))

print("Polygon 0 Shape : {}".format(np.array(multi_polygon[0]).shape))
print("Polygon 0 Data : {}".format(multi_polygon[0]))
print("\nPolygon 1 Shape  : {}".format(np.array(multi_polygon[1]).shape))
print("Polygon 1 Data : {}".format(multi_polygon[1]))
print("\nLast Polygon Shape : {}".format(np.array(multi_polygon[-1]).shape))
print("Last Polygon Data : {}".format(multi_polygon[-1]))
Shape Type : MultiPolygon
State Name : Alaska

Number of Polygons : 39

Polygon 0 Shape : (1, 6, 2)
Polygon 0 Data : [[[-131.602021, 55.117982], [-131.569159, 55.28229], [-131.355558, 55.183705], [-131.38842, 55.01392], [-131.645836, 55.035827], [-131.602021, 55.117982]]]

Polygon 1 Shape  : (1, 5, 2)
Polygon 1 Data : [[[-131.832052, 55.42469], [-131.645836, 55.304197], [-131.749898, 55.128935], [-131.832052, 55.189182], [-131.832052, 55.42469]]]

Last Polygon Shape : (1, 7, 2)
Last Polygon Data : [[[173.107557, 52.992929], [173.293773, 52.927205], [173.304726, 52.823143], [172.90491, 52.762897], [172.642017, 52.927205], [172.642017, 53.003883], [173.107557, 52.992929]]]

At last, we have created an awkward array from our GeoJSON data that we loaded and displayed information about in previous cells. We'll be using this awkward array in our upcoming sections to explain a few examples.

In [18]:
ak_us_states = ak.Array(us_states)

ak_us_states
Out[18]:
<Array [{type: 'Feature', id: 'AL', ... ] type='50 * {"type": string, "id": stri...'>

Below we have printed type information of our awkward array created from GeoJSON dataset.

  • We can notice that it shows data has 50 entries each of type dictionary.
  • The dictionary has keys named type, id, properties, and geometry.
  • The values present inside key properties is another dictionary with keys name and population.
  • The values present inside key coordinates is another dictionary with keys type and coordinates.
  • The coordinates key has variable-length data.

Please feel free to look above where we printed the first and last few characters of the dataset to match it with the data type printed below.

In [19]:
ak_us_states.type
Out[19]:
50 * {"type": string, "id": string, "properties": {"name": string, "population": int64}, "geometry": {"type": string, "coordinates": var * var * var * union[float64, var * float64]}}

2. Awkward Array Indexing/Slicing

In this section, we'll explain with examples how we can perform indexing through our awkward array just like we do with numpy arrays. We'll be using two major arrays we created earlier for example purposes.

Example 1

In this example, we'll be performing indexing on our first array which we had created at the beginning of the tutorial.

Below we have accessed the 0th element of our awkward array using simple integer indexing.

In [20]:
ak_arr[0]
Out[20]:
<Array [{x: [1, 2, 3], y: {, ... z: [11, 12]}}] type='2 * {"x": option[var * int...'>

Whenever we are indexing an awkward array and the next elements to index in dimensions are dictionaries then we can give keys of dictionary inside square brackets to access contents inside of dictionary.

Below we have first accessed the 0th element of the awkward array which will bring an array that will have 2 elements inside of it. Both elements are dictionaries with keys x and y. We have then provided the second dimension as x to index the dictionary. This will bring x values of both dictionaries.

In [21]:
ak_arr[0, "x"]
Out[21]:
<Array [[1, 2, 3], [7, 8, 9, 10]] type='2 * option[var * int64]'>

Awkward array let us access elements of the first dictionary in the array by directly calling them with keys of the dictionary. The first dictionary can not be at the first level in array but it can be 2-3 levels down as well.

Below we have accessed all x values of our array by simply calling awkward array with key x. This will return value inside of array with key x for all elements. It'll also follow the level structure. The dictionary with key x was inside of elements of the main array hence we can notice in output there are 2 brackets before elements of x.

In [22]:
ak_arr["x"]
Out[22]:
<Array [[[1, 2, 3], ... 34, 35, 36], None]] type='4 * var * option[var * int64]'>

In the next cell, we have explained how we can access all x values with numpy like array indexing. This will give the same result as our previous cell.

In [23]:
x = ak_arr[:, :, "x"]

print(x)
[[[1, 2, 3], [7, 8, 9, 10]], ... 23], [29, 30, 31, 32, 33, 34, 35, 36], None]]

In the below cell, we have explained how we can get the 0th element of the lists that are present in all x keys. We have treated the first two dimensions as a numpy array and given : to select all elements. All elements will be dictionaries so we have given string 'x' to select all x values which are lists and then we gave index 0 to select 0th element from all x lists.

In [24]:
x = ak_arr[:, :, "x", 0]

print(x)
[[1, 7], [13, None], [], [20, 29, None]]

In the next cell, we have explained how we can get the 0th element from all lists inside of our main list.

In [25]:
x = ak_arr[:, :, 0]

print(x)
[[{x: 1, y: {z: 4}}, {x: 7, y: {z: 11}}, ... x: 29, y: None}, {x: None, y: {z: 37}}]]

In the below cell, we have tried to retrieve the first two elements of each list present in the main array with the key x.

In [26]:
x = ak_arr[:, :, "x", 0:2]

print(x)
[[[1, 2], [7, 8]], [[13, 14], None], [], [[20, 21], [29, 30], None]]

In the next cell, we have first taken the 0th element from the list which will be a list of two dictionaries, then we have selected the first dictionary from it and retrieved 'y' key from it which is again a dictionary.

In [27]:
x = ak_arr[0, 0, "y"]

print(x)
{z: [4, 5, 6]}

In the next cell, we have taken the 0th element of our array which is a list of two dictionaries, then we have selected both dictionaries using : operator, and at last, we have retrieved element 'y' from both dictionaries.

In [28]:
x = ak_arr[0, :, "y"]

print(x)
[{z: [4, 5, 6]}, {z: [11, 12]}]

In the next cell, we have retrieved 'z' element from both dictionaries of 1st element from 'y' key.

In [29]:
x = ak_arr[0, :, "y", "z"]

print(x)
[[4, 5, 6], [11, 12]]

In the next cell, we have retrieved first all 'z' key values using the same indexing as our previous cell and at last, we have retrieved 1st elements from all list of 'z' values.

In [30]:
x = ak_arr[0, :, "y", "z", 1]

print(x)
[5, 12]

In the next cell, we have selected all elements from the first dimension which will be all arrays inside of our main array, followed by all elements of those arrays which will be all dictionaries. We have then retrieved 'y' key value for all dictionaries, followed by all 'z' key values for all those 'y' values because the values of 'y' key is again a dictionary. Then we have retrieved the 0th value from all 'z' key values.

In [31]:
x = ak_arr[:, :, "y", "z", 0]

print(x)
[[4, 11], [None, 19], [], [24, None, 37]]

Our next cell retrieved the last element of all 'z' key arrays. It uses almost the same indexing as our previous cell with only one difference at last.

In [32]:
x = ak_arr[:, :, "y", "z", -1]

print(x)
[[6, 12], [None, 19], [], [28, None, 38]]

In the next cell, we have explained how we can access elements of the first dictionary of our awkward array by treating key values as attributes of an array object. We have retrieved all 'x' values. This is almost same as the code ak_arr['x'].

In [33]:
ak_arr.x
Out[33]:
<Array [[[1, 2, 3], ... 34, 35, 36], None]] type='4 * var * option[var * int64]'>

In the next cell, we have retrieved all 'x' values from 1st element of the array.

In [34]:
ak_arr.x[0]
Out[34]:
<Array [[1, 2, 3], [7, 8, 9, 10]] type='2 * option[var * int64]'>

We can also retrieve 'y' key values because 'y' key is in the same dictionary and at the same level as 'x' key.

In [35]:
ak_arr.y
Out[35]:
<Array [[{z: [4, 5, 6]}, ... {z: [37, 38]}]] type='4 * var * ?{"z": var * int64}'>

Below we have retrieved the value of the 'z' key by treating it as an attribute.

In [36]:
ak_arr.y.z
Out[36]:
<Array [[[4, 5, 6], ... None, [37, 38]]] type='4 * var * option[var * int64]'>

In the next cell, we have further created a few more simple examples of indexing.

In [37]:
ak_arr.y.z[0], ak_arr.y.z[1], ak_arr.y.z[2], ak_arr.y.z[3]
Out[37]:
(<Array [[4, 5, 6], [11, 12]] type='2 * option[var * int64]'>,
 <Array [None, [19]] type='2 * option[var * int64]'>,
 <Array [] type='0 * option[var * int64]'>,
 <Array [[24, 25, 26, 27, ... None, [37, 38]] type='3 * option[var * int64]'>)

Example 2

In this section, we'll explain indexing on our awkward array which we created by loading contents from the GeoJSON file.

In the below cell, we have first retrieved all elements (using ':' to select all values in the first dimension) of our awkward array loaded from GeoJSON which is a list of dictionaries, and then taken 'id' key from all lists. This will return us an array of all state codes of the United States.

In [38]:
ak_us_states[:, "id"]
Out[38]:
<Array ['AL', 'AK', 'AZ', ... 'WV', 'WI', 'WY'] type='50 * string'>

In the below cell, we have first retrieved all elements of our awkward array, we have then retrieved 'properties' key values for all of them. The values of 'properties' key are again dictionaries hence we have retrieved 'name' key value from them which will return us the name of US states.

In [39]:
x = ak_us_states[:, "properties", "name"]

print(x)
['Alabama', 'Alaska', 'Arizona', ... 'West Virginia', 'Wisconsin', 'Wyoming']

In the next cell, we have retrieved 'type' key values from 'geometry' key values for all dictionaries of our array. This will help us see the geometry type of all states. We can notice that the geometry type for Alabama is Polygon whereas the geometry type for Alaska is Multi-Polygon.

In [40]:
x = ak_us_states[:, "geometry", "type"]

print(x)
['Polygon', 'MultiPolygon', 'Polygon', ... 'Polygon', 'Polygon', 'Polygon']

In the below cell, we have retrieved values of 'coordinates' key which is inside of 'geometry' dictionary for all dictionaries of our awkward array.

In [41]:
x = ak_us_states[:, "geometry", "coordinates"]

print(x)
[[[[-87.4, 35], [-85.6, 35], [-85.4, 34.1, ... -111, 44.5], [-111, 45], [-109, 45]]]]

In the below cell, we have retrieved the 0th value of all 'coordinates' keys which are inside of 'geometry' key for all dictionaries of our awkward array.

In [42]:
x = ak_us_states[:, "geometry", "coordinates", 0]

print(x)
[[[-87.4, 35], [-85.6, 35], [-85.4, 34.1], ... [-111, 44.5], [-111, 45], [-109, 45]]]

In the below cell, we have taken the 0th element of our awkward array which is the dictionary. We have then retrieved 'coordinates' key value inside of 'geometry' key of that dictionary. Then we have taken the 0th element of that value. We know that values that are stored inside of 'coordinates' key are multi-dimensional arrays storing Polygon or Multi-Polygon geometry data. For our first dictionary, the geometry is a Polygon of 3-dimensional shape. We have taken the 0th element from this 3-dimensional array which will be another 2-dimensional array as we can see in the output.

In [43]:
x = ak_us_states[0, "geometry", "coordinates", 0]

print(x)
[[-87.4, 35], [-85.6, 35], [-85.4, 34.1], ... -88.1, 34.9], [-88.2, 35], [-87.4, 35]]

In the below cell, we have retrieved the first value of the 3-dimensional array present inside of the value of 'coordinates' key. We have given three times 0 inside of brackets to follow 3 dimensions.

In [44]:
x = ak_us_states[0, "geometry", "coordinates", 0, 0, 0]

print(x)
-87.359296

In the below cell, we have tried to retrieve the 0th element for all 3-dimensional or 4-dimensional arrays present inside of 'coordinates' key. We have given first dimension indexing as ':' to select all dictionaries from our array. We can notice that sometimes our first element is a single float and sometimes, it's an array of two floats. The reason behind this is that for Polygon shapes, the coordinates are represented using a 3-dimensional array and for Multi-Polygon shapes, the coordinates are represented using a 4-dimensional array.

In [45]:
x = ak_us_states[:, "geometry", "coordinates", 0, 0, 0]

print(x)
[-87.4, [-132, 55.1], -109, -94.5, -123, -108, ... [-117, 49], -80.5, -90.4, -109]

In the next two cells, we have explained how we can retrieve information by treating the dictionary key as an attribute of the awkward array.

In [46]:
ak_us_states.geometry.coordinates
Out[46]:
<Array [[[[-87.4, 35], ... [-109, 45]]]] type='50 * var * var * var * union[floa...'>
In [47]:
ak_us_states.geometry.coordinates[0][0][0]
Out[47]:
<Array [-87.4, 35] type='2 * union[float64, var * float64]'>

3. Awkward Array Attributes

In this section, we'll introduce a few useful properties or attributes available for our awkward array object which can be useful when working with them.

We can retrieve the fields/keys of our first dictionary inside of our awkward array using fields attribute.

Below we have printed fields information for both arrays which we created earlier. We can notice that for the first array it only includes fields 'x' and 'y', it does not include key 'z' which is inside of key 'y'. The same is the case with our second awkward array of GeoJSON data.

In [48]:
ak_arr.fields
Out[48]:
['x', 'y']
In [49]:
ak_us_states.fields
Out[49]:
['type', 'id', 'properties', 'geometry']

The nbytes attribute returns a number of bytes required to store an array in memory.

In [50]:
ak_arr.nbytes
Out[50]:
552
In [51]:
ak_us_states.nbytes
Out[51]:
89401

The ndim attribute returns the number of dimensions of our awkward array.

In [52]:
ak_arr.ndim
Out[52]:
2
In [53]:
ak_us_states.ndim
Out[53]:
1

The type attribute as we had discussed earlier returns the type of our awkward array.

In [54]:
ak_arr.type
Out[54]:
4 * var * {"x": option[var * int64], "y": ?{"z": var * int64}}
In [55]:
ak_us_states.type
Out[55]:
50 * {"type": string, "id": string, "properties": {"name": string, "population": int64}, "geometry": {"type": string, "coordinates": var * var * var * union[float64, var * float64]}}

In the next cell, we have explained that we can treat keys of our first dictionary inside of awkward array as attributes of an array.

In [56]:
ak_arr.x
Out[56]:
<Array [[[1, 2, 3], ... 34, 35, 36], None]] type='4 * var * option[var * int64]'>
In [57]:
ak_arr.y
Out[57]:
<Array [[{z: [4, 5, 6]}, ... {z: [37, 38]}]] type='4 * var * ?{"z": var * int64}'>
In [58]:
ak_us_states.geometry
Out[58]:
<Array [{type: 'Polygon', ... ] type='50 * {"type": string, "coordinates": var *...'>
In [59]:
ak_us_states.id
Out[59]:
<Array ['AL', 'AK', 'AZ', ... 'WV', 'WI', 'WY'] type='50 * string'>

4. Normal Array Operations

In this section, we'll explain commonly performing operations with arrays like filtering entries, flattening arrays, combining arrays, checking for nulls, counting non-zero elements, conditional operations, etc. We'll be using our arrays which we created earlier to explain various functions in this section.

Filtering Arrays

Awkward array lets us filter our arrays based on some conditions. It let us filter arrays just like we filter rows of the pandas’ data frame.

Below we have created a condition that takes all entries of our array and then takes 'x' key for each entry. It then takes the 0th element of all values of key 'x' and compares it with value 20. It returns True if the value is greater than or equal to 20.

In the next cell, we have filtered our main array based on the condition that we created below. We have also printed the result of filtering our main array.

In [60]:
ak_arr[:, "x", :, 0] >= 20
Out[60]:
<Array [[False, False], ... True, True, None]] type='4 * var * ?bool'>
In [61]:
x = ak_arr[ak_arr[:, "x", :, 0] >= 20]

print(x)

print(x["x"])
[[], [None], ... 27, 28]}}, {x: [29, 30, 31, 32, 33, 34, 35, 36], y: None}, None]]
[[], [None], [], [[20, 21, 22, 23], [29, 30, 31, 32, 33, 34, 35, 36], None]]

In the below cell, we have created another condition where we check for the last element of the value of key 'z' of our awkward array. We then filter our main array based on the result of this condition.

In [62]:
ak_arr[:, "y", "z", :, -1] >= 20
Out[62]:
<Array [[False, False], ... True, None, True]] type='4 * var * ?bool'>
In [70]:
x = ak_arr[ak_arr[:, "y", "z", :, -1] >= 20]

print(x)

print(x["x"])
[[], [None], ... y: {z: [24, 25, 26, 27, 28]}}, None, {x: None, y: {z: [37, 38]}}]]
[[], [None], [], [[20, 21, 22, 23], None, None]]

Below we have explained another example of filtering an awkward array. This time we have filtered our GeoJSON array. We have created a condition that checks for the Polygon geometry type of each entry of our array. We filter our main awkward array to keep only entries where the geometry type is Polygon. We have then also printed the count of entries which has Polygon geometry and count of total elements of our array for comparison. We can notice from the results that there are 43 Polygon geometries from our total 50 geometries. We have also printed state IDs to compare which state has Polygon geometry and which has Multi-Polygon geometry.

In [89]:
x = ak_us_states[ak_us_states["geometry", "type"] == "Polygon"]

print("Number of States with Polygon Geometry : {}".format(len(x["id"])))
print("Number of States with Polygon and MultiPolygon Geometry : {}".format(len(ak_us_states["id"])))
print()
print(x["id"])
print(ak_us_states["id"])
Number of States with Polygon Geometry : 43
Number of States with Polygon and MultiPolygon Geometry : 50

['AL', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', ... 'TX', 'UT', 'VT', 'WV', 'WI', 'WY']
['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', ... 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']

In the below cell, we have divided entries of our GeoJSON awkward array into two categories based on condition. The first condition check for entries where the population is greater than 1M and the second condition checks for entries where the population is less than 1M. We have then printed the count of states where the population is greater than 1M and where it's less than 1M.

In [91]:
x = ak_us_states[ak_us_states["properties", "population"] > 1e6]
y = ak_us_states[ak_us_states["properties", "population"] <= 1e6]

print("Number of States with Population Greater Than 1M : {}".format(len(x["id"])))
print("Number of States with Population Less    Than 1M : {}".format(len(y["id"])))
print()
print(x["properties", "population"])
print(y["properties", "population"])
Number of States with Population Greater Than 1M : 44
Number of States with Population Less    Than 1M : 6

[4888949, 7123898, 3020327, 39776830, ... 8525660, 7530552, 1803077, 5818049]
[738068, 971180, 755238, 877790, 623960, 573720]

In the below cell we have again filtered our main array to keep only entries where geometry is Polygon. We have then taken all latitudes and longitudes of all elements in separate arrays.

In [124]:
x = ak_us_states[ak_us_states["geometry", "type"] == "Polygon"]
latitudes = x["geometry", "coordinates", :, :, :, 0]
longitudes = x["geometry", "coordinates", :, :, :, 1]

print("Polygon State Latitudes : {}".format(latitudes))
print("Polygon State Longitudes : {}".format(longitudes))
Polygon State Latitudes : [[[-87.4, -85.6, -85.4, -85.2, -85.1, -85, ... -109, -111, -111, -111, -111, -109]]]
Polygon State Longitudes : [[[35, 35, 34.1, 32.9, 32.6, 32.4, 32.3, 32.3, ... 41, 41, 41, 42, 44.5, 45, 45]]]

Copying Array

We can create a copy of any awkward array by just calling copy() function on it.

In [190]:
ak_arr2 = ak.copy(ak_arr)

ak_arr2
Out[190]:
<Array [[{x: [1, 2, 3], ... z: [37, 38]}}]] type='4 * var * {"x": option[var * i...'>

Counting Elements

We can count the number of elements in an array using count() function. If you want to retrieve non-zero entries then use count_nonzero() function.

In [82]:
ak.count(ak_arr)
Out[82]:
38
In [81]:
ak.count(ak_arr["x"]), ak.count(ak_arr["y"]), ak.count(ak_arr["y"]["z"])
Out[81]:
(25, 13, 13)
In [88]:
ak.count(ak_us_states), ak.count(ak_us_states["properties"]["population"])
Out[88]:
(8341, 50)
In [77]:
ak.count_nonzero(ak_arr["x"])
Out[77]:
25

Flattening Array

We noticed in many of our examples that whenever we retrieve arrays from our main array using indexing or conditions, it maintains the structure of the main array the majority of the time. This can create arrays with multiple levels. We can flatten such arrays with flatten() function. We have explained below how we can use it.

In [91]:
x = ak.flatten(ak_arr["x"])

print(x)

x = ak.flatten(x)

print(x)
[[1, 2, 3], [7, 8, 9, 10], [13, 14, ... 23], [29, 30, 31, 32, 33, 34, 35, 36], None]
[1, 2, 3, 7, 8, 9, 10, 13, 14, 15, 16, ... 22, 23, 29, 30, 31, 32, 33, 34, 35, 36]

The ravel() function works exactly like flatten() function and can be used to flatten an array.

In [177]:
x = ak.ravel(ak_arr["x"])

print(x)
[1, 2, 3, 7, 8, 9, 10, 13, 14, 15, 16, ... 22, 23, 29, 30, 31, 32, 33, 34, 35, 36]
In [249]:
ak.ravel(ak_arr[:, "x", :, 0])
Out[249]:
<Array [1, 7, 13, 20, 29] type='5 * int64'>

Concatenate Arrays

We can concatenate more than one awkward array using concatenate() function. Below we have explained with simple examples how we can concatenate arrays.

In [63]:
print(ak_arr["x"])
print(ak_arr["y"])
print(ak_arr["y"]["z"])
[[[1, 2, 3], [7, 8, 9, 10]], ... 23], [29, 30, 31, 32, 33, 34, 35, 36], None]]
[[{z: [4, 5, 6]}, {z: [11, 12]}], ... 24, 25, 26, 27, 28]}, None, {z: [37, 38]}]]
[[[4, 5, 6], [11, 12]], [None, [19]], [], [[24, 25, 26, 27, 28], None, [37, 38]]]
In [179]:
x = ak.concatenate((ak_arr["x"], ak_arr["y"]))

print(x)
[[[1, 2, 3], [7, 8, 9, 10]], [[13, ... 24, 25, 26, 27, 28]}, None, {z: [37, 38]}]]
In [180]:
x = ak.concatenate((ak_arr["x"], ak_arr["y"]["z"]))

print(x)
[[[1, 2, 3], [7, 8, 9, 10]], [[13, ... [], [[24, 25, 26, 27, 28], None, [37, 38]]]

Selecting First Elements

We can retrieve first elements from our awkward array using firsts() function. Below we have retrieved the first elements from our awkward array. We have first printed the original array and then all the first elements for comparison.

In [207]:
x = ak.firsts(ak_arr)

print("All Elements : ")
for elem in ak_arr:
    print(elem)

print("\nFirst Elements")
for elem in x:
    print(elem)
All Elements :
[{x: [1, 2, 3], y: {z: [4, 5, 6]}}, {x: [7, 8, 9, 10], y: {z: [11, 12]}}]
[{x: [13, 14, 15, 16, 17, 18], y: None}, {x: None, y: {z: [19]}}]
[]
[{x: [20, 21, 22, 23], y: {z: [24, 25, 26, ... y: None}, {x: None, y: {z: [37, 38]}}]

First Elements
{x: [1, 2, 3], y: {z: [4, 5, 6]}}
{x: [13, 14, 15, 16, 17, 18], y: None}
None
{x: [20, 21, 22, 23], y: {z: [24, 25, 26, 27, 28]}}

Checking for Null Elements

We can check for Null / NaNs in our awkward array using is_none() function.

In [208]:
ak.is_none(ak_arr)
Out[208]:
<Array [False, False, False, False] type='4 * bool'>
In [334]:
print(ak_arr["x"])
print(ak.is_none(ak_arr["x"], axis=1))
[[[1, 2, 3], [7, 8, 9, 10]], ... 23], [29, 30, 31, 32, 33, 34, 35, 36], None]]
[[False, False], [False, True], [], [False, False, True]]
In [281]:
print(ak_arr["x"])
print(ak_arr["y", "z"])
[[[1, 2, 3], [7, 8, 9, 10]], ... 23], [29, 30, 31, 32, 33, 34, 35, 36], None]]
[[[4, 5, 6], [11, 12]], [None, [19]], [], [[24, 25, 26, 27, 28], None, [37, 38]]]

Zipping Array Elements

The zip() function works almost exactly like the python version of it.

In [289]:
print("First Array  : ", ak_arr["x", 0,0])
print("Second Array : ", ak_arr["y", "z", 0,0])
print("Zipped Array : ", ak.zip((ak_arr["x", 0, 0], ak_arr["y", "z", 0, 0])))
First Array  :  [1, 2, 3]
Second Array :  [4, 5, 6]
Zipped Array :  [(1, 4), (2, 5), (3, 6)]

Conditional Operations on Array

We can also perform the conditional operation on the array using where() function just like we do with numpy array. Below we have created three arrays that are the same as our first array but elements of them are replaced with ones, zeros, and 100. We'll be using these arrays to explain where() function.

In [294]:
ak_arr_ones = ak.ones_like(ak_arr)
ak_arr_zeros = ak.zeros_like(ak_arr)
ak_arr_hundred = ak.full_like(ak_arr, 100)

Below we have created a simple condition which checks for each entry inside of key 'x' of our array and returns True if the value is greater than or equal to 20 else returns False. We'll be using this condition inside of where() function. We have also printed the output of the condition to make comparison easy.

In [308]:
condition = ak_arr["x"] >= 20

for elem in condition:
    print(elem)
[[False, False, False], [False, False, False, False]]
[[False, False, False, False, False, False], None]
[]
[[True, True, True, True], ... True, True, True, True, True, True, True, True], None]

Below we have given the condition which we created in the previous cell as the first input to our where() function followed by 'x' values of the original array and 'x' values of zeros array.

We have then printed the elements of the resulted array. We can notice that at all places where entries were greater than or equal to 20 were kept and all other entries were set as zero.

In [309]:
x = ak.where(condition, ak_arr["x"], ak_arr_zeros["x"])

for elem in x:
    print(elem)
[[{x: [0, 0, 0], y: {z: [0, 0, 0]}}, {x: [, ... {x: [0, 0, 0, 0], y: {z: [0, 0]}}]]
[[{x: [0, 0, 0, 0, 0, 0], y: None}, {x: [0, 0, 0, 0, 0, 0], y: None}, ... None]
[]
[[{x: [20, 21, 22, 23], y: {z: [24, 25, ... 31, 32, 33, 34, 35, 36], y: None}], None]

Below we have created another condition where we are checking for all 'z' key values for the condition greater than or equal to 10.

Then in the cell below, we have replaced all entries in the array which are greater less than or equal to 10 with hundred.

In [310]:
condition = ak_arr[:, "y", "z"] >= 10

for elem in condition:
    print(elem)
[[False, False, False], [True, True]]
[None, [True]]
[]
[[True, True, True, True, True], None, [True, True]]
In [311]:
x = ak.where(condition, ak_arr[:,"y","z"], ak_arr_hundred[:,"y","z"])

for elem in x:
    print(elem)
[[{x: [100, 100, 100], y: {z: [100, 100, 100]}, ... 7, 8, 9, 10], y: {z: [11, 12]}}]]
[None, [{x: None, y: {z: [19]}}]]
[]
[[{x: [20, 21, 22, 23], y: {z: [24, 25, 26, 27, ... {x: None, y: {z: [37, 38]}}]]

5. Simple Statistics

In this section, we'll explain how we can perform simple statistics like mean, variance, standard deviation, addition, etc on our awkward array entries.

Minimum

We can retrieve the minimum element of awkward array just like numpy using min() method. We can also retrieve minimum elements at a particular axis by providing axis value.

Below we have first displayed array, then displayed a minimum of that array followed by minimum elements at the first axis.

In [353]:
print("Original Array            : ", ak_arr[:, "x",:,  0])

print("Minimum Element           : ", ak.min(ak_arr[:, "x",:,  0]))
print("Minimum Elements (Axis=1) : ", ak.min(ak_arr[:, "x",:,  0], axis=1))
Original Array :  [[1, 7], [13, None], [], [20, 29, None]]
Minimum Element :  1
Minimum Elements (Axis=1) [1, 13, None, 20]

We can also retrieve an index of the minimum element just like numpy using argmin() function.

In [379]:
print("Original Array                  : ", ak_arr[:, "x",:,  0])

print("Minimum Element Index           : ", ak.argmin(ak_arr[:, "x",:,  0]))
print("Minimum Elements Index (Axis=1) : ", ak.argmin(ak_arr[:, "x",:,  0], axis=1))
Original Array                  :  [[1, 7], [13, None], [], [20, 29, None]]
Minimum Element Index           :  0
Minimum Elements Index (Axis=1) :  [0, 0, None, 0]

Maximum

Just like minimum, we can retrieve maximum elements using max() method and index of maximum elements using argmax() method.

In [354]:
print("Original Array            : ", ak_arr[:, "x",:,  0])

print("Maximum Element           : ", ak.max(ak_arr[:, "x",:,  0]))
print("Maximum Elements (Axis=1) : ", ak.max(ak_arr[:, "x",:,  0], axis=1))
Original Array            :  [[1, 7], [13, None], [], [20, 29, None]]
Maximum Element           :  29
Maximum Elements (Axis=1) :  [7, 13, None, 29]
In [380]:
print("Original Array                  : ", ak_arr[:, "x",:,  0])

print("Maximum Element Index           : ", ak.argmax(ak_arr[:, "x",:,  0]))
print("Maximum Elements Index (Axis=1) : ", ak.argmax(ak_arr[:, "x",:,  0], axis=1))
Original Array                  :  [[1, 7], [13, None], [], [20, 29, None]]
Maximum Element Index           :  4
Maximum Elements Index (Axis=1) :  [1, 0, None, 1]

Average

We can calculate the mean of an awkward array using mean() function.

In [368]:
print("Original Array : ", ak_arr[:, "x",:,  0])

print("Mean           : ", ak.mean(ak_arr[:, "x",:,  0]))
print("Mean (Axis=1)  : ", ak.mean(ak_arr[:, "x",:,  0], axis=1))
Original Array :  [[1, 7], [13, None], [], [20, 29, None]]
Mean           :  14.0
Mean (Axis=1)  :  [4, 13, None, 24.5]

Sum

We can add elements of an array using sum() function. Just like other functions, we can perform addition at a particular axis as well.

In [369]:
print("Original Array : ", ak_arr[:, "x",:,  0])

print("Sum           : ", ak.sum(ak_arr[:, "x",:,  0]))
print("Sum (Axis=1)  : ", ak.sum(ak_arr[:, "x",:,  0], axis=1))
Original Array :  [[1, 7], [13, None], [], [20, 29, None]]
Sum           :  70
Sum (Axis=1)  :  [8, 13, 0, 49]

Standard Deviation

The std() function can be used to calculate standard deviation.

In [381]:
print("Original Array               : ", ak_arr[:, "x",:,  0])

print("Standard Deviation           : ", ak.std(ak_arr[:, "x",:,  0]))
print("Standard Deviation (Axis=1)  : ", ak.std(ak_arr[:, "x",:,  0], axis=1))
Original Array               :  [[1, 7], [13, None], [], [20, 29, None]]
Standard Deviation           :  9.797958971132712
Standard Deviation (Axis=1)  :  [3, 0, None, 4.5]

Variance

The var() function can be used to calculate the variance of the array.

In [382]:
print("Original Array     : ", ak_arr[:, "x",:,  0])

print("Variance           : ", ak.var(ak_arr[:, "x",:,  0]))
print("Variance (Axis=1)  : ", ak.var(ak_arr[:, "x",:,  0], axis=1))
Original Array     :  [[1, 7], [13, None], [], [20, 29, None]]
Variance           :  96.0
Variance (Axis=1)  :  [9, 0, None, 20.2]

Sorting

We can sort elements of the array using sort() method. We can sort elements in descending order by setting ascending argument to False.

In [387]:
print("Original Array          : ", ak_arr[:, "x",:,  0])

print("Sorted Array Descending : ", ak.sort(ak_arr[:, "x",:,  0], ascending=False))
Original Array          :  [[1, 7], [13, None], [], [20, 29, None]]
Sorted Array Descending :  [[7, 1], [13, None], [], [29, 20, None]]

This ends our small tutorial explaining how we can create awkward array to work with tree-like data structures using numpy-like idioms. Please feel free to let us know your views in the comments section.

References

Sunny Solanki  Sunny Solanki

 Want to Share Your Views? Have Any Suggestions?

If you want to

  • provide some suggestions on topic
  • share your views
  • include some details in tutorial
  • suggest some new topics on which we should create tutorials/blogs
Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking HERE.