## Exploring the main and trailing zeros, distribution of letters and numbers, not unusual prefixes, common expressions, and randomization of the information set.

According to the ICAO same old, the passport quantity must be as much as Nine characters lengthy and will comprise numbers and letters. During your paintings as an analyst, you’ll come alongside a knowledge set containing the passports and you’ll be requested to discover it.

I’ve not too long ago labored with one such set and I’d love to percentage the stairs of this research with you, together with:

- Number of data
- Duplicated data
- Length of the data
- Analysis of the main and trailing zeros
- Appearance of the nature at a particular place
- Where do letters seem in the string the usage of common expressions (regexes)
- Length at the series of letters
- Is there a not unusual prefix
- Anonymize/Randomize the information whilst protecting the traits of the dataset

You can move throughout the steps with me. Get the (randomized knowledge) from github. It additionally incorporates the Jupiter notebook with all of the steps.

## Basic Dataset Exploration

First, let’s load the information. Since the dataset incorporates just one column, it’s reasonably simple.

# import the applications which can be used

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as snsdf = pd.read_csv(r"pathdata.csv")

df.data()

The .data() command will let us know that we’ve got 10902 passports in the dataset and all are imported as “object” because of this that the structure is `string`

.

## Duplicates

As an preliminary step of any research must be the test if there are some duplicated values. In our case, there are, so we can take away them the usage of pandas’s `.drop_duplicates()`

way.

print(len(df["PassportNumber"].distinctive()))# if not up to 10902 there are duplicatesdf.drop_duplicates(inplace=True) # or df = df.drop_duplicates()

**Length of the passports**

Usually, you proceed with the test of the longest and the shortest passport.

`[In]: df["PassportNumber"].agg(["min","max"])`

[Out]:

min 000000050

max ZXD244549

Name: PassportNumber, dtype: object

You would possibly turn out to be satisfied, that all of the passports are Nine characters lengthy, however you could possibly be misled. The knowledge are having string structure in order that the bottom “string” price is the only which begins with the best selection of zeros and the biggest the only which has essentially the most zeds originally.

`# ordering of the strings isn't the similar as order of numbers`

0 > 0001 > 001 > 1 > 123 > AB > Z > Z123 > ZZ123

In order to look the period of the passports let’s take a look at their period.

`[In]: df["PassportNumber"].observe(len).agg(["min","max"])`

[Out]:

min 3

max 17

Name: PassportNumber, dtype: object

In the contracts to our preliminary trust, the shortest passport incorporates best Three characters whilst the longest is 17 (excess of the anticipated most of 9) characters lengthy.

Let’s make bigger our knowledge body with the `'len'`

column in order that we will take a look at examples:

# Add the 'len' column

df['len'] = df["PassportNumber"].observe(len)# glance at the examples having the utmost lenght

[In]: df[df["len"]==df['len'].max()]

[Out]:

PassportNumber len

25109300000000000 17

27006100000000000 17# glance at the examples having the minimal lenght

[In]: df[df["len"]==df['len'].min()]

[Out]:

PassportNumber len

179 3

917 3

237 3

The Three digit passport numbers glance suspicious, however they meet the ICAO standards, however the longest ones are patently too lengthy, on the other hand, they comprise reasonably many trailing zeros. Maybe somebody simply added the zeros in order to satisfy some knowledge garage necessities.

Let’s take a look on the general period distribution of our knowledge pattern.

# calculate rely of look of more than a few lengths

counts_by_value = df["len"].value_counts().reset_index()

separator = pd.Series(["|"]*df["len"].value_counts().form[0])

separator.identify = "|"

counts_by_index = df["len"].value_counts().sort_index().reset_index()lenght_distribution_df = pd.concat([counts_by_value, separator, counts_by_index], axis=1)# draw the chart

ax = df["len"].value_counts().sort_index().plot(type="bar")

ax.set_xlabel("place")

ax.set_ylabel("selection of data")

for p in ax.patches:

ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.05))

We see, that essentially the most passports quantity in our pattern, are 7, eight or Nine characters lengthy. Quite a couple of are on the other hand 10 or 12 characters lengthy, which is surprising.

## Leading and trailing zeros

Maybe the lengthy passports are having main or trailing zeros, like our instance with 17 characters.

In order to discover those zero-pads let’s upload two extra columns to our knowledge set — ‘leading_zeros’ and ‘trailing_zeros’ to comprise the selection of main and trailing zeros.

# selection of main zeros can also be calculated by way of subtracting the period of the string l-stripped off the main zeros from the entire period of the stringdf["leading_zeros"] = df["PassportNumber"].observe(lambda x: len(x) - len(x.lstrip("0")))# in a similar fashion the selection of the trailing zeros can also be calculated by way of subtracting the period of the string r-stripped off the main zeros from the entire period of the stringdf["trailing_zeros"] = df["PassportNumber"].observe(lambda x: len(x) - len(x.rstrip("0")))

Then we will simply show the passport which has greater than Nine characters to test if the have any main or trailing zeros:

`[In]: df[df["len"]>9]`

[Out]:

PassportNumber len leading_zeros trailing_zeros

73846290957 11 0 0

N614226700 10 0 2

WC76717593 10 0 0

...

## Distribution of zeros at particular place

Most of the passports in the set don’t have any zeros and nonetheless, they’re longer than Nine characters. Just for the sake of showing it, let’s take a look on the distribution of zeros on every place of the passport numbers.

We know that the shortest passport fills in Three positions and the longest have 17. Let’s iterate thru all of the passport numbers after which iterate thru all their characters to look the place are zeros.

passports_list = []

# for every passport quantity

for passport_number in df["PassportNumber"]:

# let's create a dictionary with the passport quantity

pos = {"PassportNumber": passport_number}

# and let's upload to every place if it is a 0 or no longer

for i, c in enumerate(passport_number):

# and for every place test if it is 0 -> True or one thing else --> False

pos[i+1] = True if c == "0" else False

passports_list.append(pos)# let's flip the dictionary into pandas dataframe

zeros_distribution = pd.DataBody(passports_list)

zeros_distribution["len"] = zeros_distribution["PassportNumber"].observe(len)

The output of this operation is the brand new knowledge body `'zeros_distribution'`

which incorporates **True** on every place the place there’s 0 in the passport quantity

Notice that I’ve highlighted the `True`

with yellow background. Styling in pandas can also be executed the usage of `.taste`

way, however you will have to watch out. Styling will plot the entire dataset, which can also be time-consuming, so specify what number of rows you wish to have to show — e.g. by way of `pattern(5)`

. You can observe the manner to a couple columns best the usage of `subset`

parameter.

# styling serve asdef highlight_true(val):

go back 'background-color: yellow' if val == 1 else ''# observe the manner to every price, the usage of applymap

# subset for place columns 1-17

zeros_distribution.pattern(5).taste.applymap(highlight_true, subset=checklist(vary(1,18)))

By **summing **the zeros at every place we will see what number of zeros seem there. By **counting **all values on the place we can see, how time and again this place is crammed in in the dataset (e.g. place 17 is never crammed in), as a result of rely(N/A) is 0. We run those aggregations best, over columns 1–17 by way of `vary(1,18)`

# summing the zeros at every place.

zero_count = zeros_distribution[range(1,18)].sum()

zero_count.identify = "is_zero"# rely how time and again this place is crammed in

position_filled_count = zeros_distribution[range(1,18)].rely()

position_filled_count.identify = "is_filled"

We see that aside from for positions 16 and 17, the values aren’t trailing zeros. It’s clearly conceivable that those values are mixtures of each main and trailing zeros, however that’s no longer very possible. We have additionally already noticed the examples of passport numbers longer than Nine characters earlier than.

We can merely `.strip("0")`

to take away each main and trailing zeros and we can see, that there are nonetheless invalid passport numbers longer than Nine characters:

`df[“PassportNumber”].str.strip(“0”).observe(len).value_counts().sort_index().plot(type=”bar”)`

## Analyze strings with common expressions

To overview string patterns it’s useful to make use of the facility of the common expression. Even despite the fact that their syntax is just a little bulky, as soon as used as it should be, they may be able to be very fast and environment friendly.

For the start let’s upload a column which is True in case the passport quantity begins with a letter and False differently.

`df["starts_with_letter"] = df["PassportNumber"].observe(lambda x: True if re.fit("`**^[a-zA-Z]+.***", x) else False)

The common expression `^[a-zA-Z]+.*"`

imply that

`^`

originally`[a-zA-Z]`

is decrease case or capital letter`+`

a number of occasions`.`

adopted by way of any personality`*`

0 or extra occasions

I love to show the consequences, no longer as a Serie best, however as a dataFrame, as a result of I generally wish to display additional information, for instance, the **rely **and the **share **like in the case of passports beginning with a letter.

`start_with_letter_count = df["starts_with_letter"].value_counts()`

pd.DataBody({"rely": start_with_letter_count,

"share": start_with_letter_count/df.form[0]}).taste.structure("{:.2%}", subset=["percentage"])

One may additionally have an interest if there are letters best originally or if additionally they seem in the center of the passport quantity (after a minimum of 1 quantity). This can once more be simply solved by way of the common expressions:

`^`

originally`.`

is one thing (letter, quantity or a personality)`*`

0 or extra occasions`d`

then a bunch`+`

at least once occasions`[a-zA-Z]`

then a letter in decrease or capital case`+`

at least once occasions

`df["letter_after_number"] = df["PassportNumber"].observe(lambda x: "Leter after quantity" if re.fit("^.*d+[a-zA-Z]+", x) else "trailing numbers best"`

If we show the rely and the proportion the usage of the similar code as above, we can see that almost all of the passports don’t have any letter in the center.

Another query, once we had been designing the passport development for our device was once what number of letters can seem at the starting of the string. In order to determine let’s devise a easy serve as the usage of regex.

`def `**lenght_of_start_letter_sequence**(string):

# re.fit returns None in case no fit is located, and making use of of .workforce(0) would result in an error

if re.fit("**^[a-zA-Z]+**", string) isn't None:

go back len(re.fit("^[a-zA-Z]+", string).workforce(0))

else:

go back 0

`^[a-zA-z]+`

imply that there are a number of letters originally. `.workforce(0)`

way of the `re.fit`

returns the primary workforce which goes this common expression. So in case of `ABC123`

it could go back `ABC`

and we can merely rely its period. The best factor is that `re.fit`

returns `None`

and `workforce()`

fails in case the development isn’t met, so we need to maintain it for events when the passport incorporates best numbers.

Once we’ve got our serve as, it’s just a topic of `observe`

ing it to the passportNumber column:

`df["lenght_of_start_letters"] = df["PassportNumber"].observe(`**lenght_of_start_letter_sequence**)

In this situation, let’s no longer display the statistics, however let’s checklist the values according to the period of letter series originally:

`df.sort_values(by way of="lenght_of_start_letters", ascending=False).loc[df["lenght_of_start_letters"]>0,["PassportNumber","lenght_of_start_letters"]]`

You are once in a while requested to supply instance values, and it’s simply executed, by way of the `pattern()`

way.

# Five examples of the passports which get started with Three letters

[In]: df[df["lenght_of_start_letters"]==3]["PassportNumber"].pattern(5).to_list()[Out]: ['DZO795085', 'SNJ370118', 'UJR13307234', 'DSG229101', 'VAA353972']

## Common prefix

There’s an opportunity that the information provider has added a prefix to the values, which was once no longer initially incorporated in the passport quantity. For maximum functions, any such prefix will have to be got rid of. But easy methods to to find it?

Let’s first suppose, that our prefix would have Three characters. Then we will simply `slice`

the primary Three characters and `value_counts`

to look the ones that are the commonest:

`[In]: df["PassportNumber"].str.slice(forestall=3).value_counts()`

[Out]:

000 41

009 37

005 33

We can see that of the values seem a lot more ceaselessly than the common incidence of the three letter prefix (`.imply()`

carried out to the code above is two.3). 009 seems 37 occasions which is a lot more ceaselessly than 2.three times.

You can make bigger your research and test if the prefix seems just for sure lengths. We can suppose that passports having 12 characters have a minimum of Three letter prefix. The following code will conceal that prefix `932`

is a lot more not unusual for 12 characters lengthy passports than same old.

c = df[["PassportNumber", "len"]]

c["prefix"] = c["PassportNumber"].str.slice(forestall=3)# workforce by way of each prefix and the entire period of the passport

prefix_by_length_df = c.groupby(["prefix", "len"]).length().reset_index()prefix_by_length_df[prefix_by_length_df["len"]==12].sort_values(by way of=0, ascending=False)

The values `932`

seems 27 occasions, whilst the common for 12 personality longs is only one.3. You can rerun those easy size-counts to shorter or longer prefixes or restrict to some other traits.

## Randomizing the passports

If you have got delicate knowledge, however you want to percentage them with a knowledge scientist, maximum shoppers would conform to percentage the anonymized knowledge or randomized pattern. In our case, let’s take a look at easy randomization, which:

- Keeps main and trailing zeros
- Change any quantity for a random quantity
- Change any letter for a random letter

import random

import string

defpassport_randomizer(list_of_passports):

processed = {} # dictionaly which can stay the {"previous": "new price"} # loop thru all of the values

for p in list_of_passports:

leading_zeros_up_to = len(p) - len(p.lstrip("0"))

trailing_zeros_up_to = len(p) - (len(p) - len(p.rstrip("0")))

out = []

for i, c in enumerate(p):

# stay the main and trailing zeros intact

if i < leading_zeros_up_to or i >= trailing_zeros_up_to:

out.append(c)

# then alternate any quantity to a random quantity

elif c.isnumeric():

out.append(str(random.randint(0,9)))

# in spite of everything the remainder for a random letter

else:

out.append(random.selection(string.ascii_letters).higher())

processed[p] = "".sign up for(out)

go back processed

Such a serve as is going personality by way of personality and adjustments the values to new random ones. The result’s a dictionary `{"previous price": "new price"}`

`{'0012300': '0050100',`

'ABC': 'LNZ',

'00EFG': '00AQT',

'IJK00': 'KVP00',

'012DF340': '032DT030'}

This way assists in keeping the main/trailing zeros for an research of the randomized knowledge, on the other hand, it could reset any repeating prefix, any such way could be extra complicated and you’ll take a look at that at house.

Having the mapping of the previous to new values, we will merely `.map()`

it to the checklist of passports to obtain a brand new randomized checklist, which assists in keeping lots of the traits of our authentic set.

`df["PassportNumber"].map(passport_randomizer(df["PassportNumber"].distinctive())).to_csv(r"new_data.csv", index=False, header=True)`

The serve as is carried out to the `.distinctive()`

passports to steer clear of the similar duplicated passport being became two other new random values. It’s nonetheless conceivable that two other passports would turn out to be the similar. To steer clear of that chance, you would need to test if the brand new random price isn’t already used in the present mapping all over the introduction of every new string.

In this newsletter, we’ve got reviewed easy methods to analyze an inventory of alphanumeric strings to expose their patterns. The ways we’ve got used are the commonest necessities in every **knowledge research** or **knowledge cleansing process** and the can also be tweaked to hide an enormous vary of identical duties like:

- counting the period of the string
- checking the main and trailing zeros
- reviewing the incidence of a particular personality (0) at a particular place
- reviewing what number of values get started with a letter
- reviewing what number of has a letter in the center
- analyzed if there’s a not unusual prefix
- randomize the dataset

Feel unfastened to play with the information your self: