Exploring the main and trailing zeros, distribution of letters and numbers, not unusual prefixes, common expressions, and randomization of the information set.

Photo by way of mana5280 on Unsplash

According to the ICAO same old, the passport quantity must be as much as Nine characters lengthy and will comprise numbers and letters. During your paintings as an analyst, you’ll come alongside a knowledge set containing the passports and you’ll be requested to discover it.

I’ve not too long ago labored with one such set and I’d love to percentage the stairs of this research with you, together with:

  • Number of data
  • Duplicated data
  • Length of the data
  • Analysis of the main and trailing zeros
  • Appearance of the nature at a particular place
  • Where do letters seem in the string the usage of common expressions (regexes)
  • Length at the series of letters
  • Is there a not unusual prefix
  • Anonymize/Randomize the information whilst protecting the traits of the dataset

You can move throughout the steps with me. Get the (randomized knowledge) from github. It additionally incorporates the Jupiter notebook with all of the steps.

First, let’s load the information. Since the dataset incorporates just one column, it’s reasonably simple.

# import the applications which can be used
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"pathdata.csv")
df.data()

The .data() command will let us know that we’ve got 10902 passports in the dataset and all are imported as “object” because of this that the structure is string.

As an preliminary step of any research must be the test if there are some duplicated values. In our case, there are, so we can take away them the usage of pandas’s .drop_duplicates() way.

print(len(df["PassportNumber"].distinctive()))# if not up to 10902 there are duplicatesdf.drop_duplicates(inplace=True) # or df = df.drop_duplicates()

Usually, you proceed with the test of the longest and the shortest passport.

[In]: df["PassportNumber"].agg(["min","max"])
[Out]:
min 000000050
max ZXD244549
Name: PassportNumber, dtype: object

You would possibly turn out to be satisfied, that all of the passports are Nine characters lengthy, however you could possibly be misled. The knowledge are having string structure in order that the bottom “string” price is the only which begins with the best selection of zeros and the biggest the only which has essentially the most zeds originally.

# ordering of the strings isn't the similar as order of numbers
0 > 0001 > 001 > 1 > 123 > AB > Z > Z123 > ZZ123

In order to look the period of the passports let’s take a look at their period.

[In]: df["PassportNumber"].observe(len).agg(["min","max"])
[Out]:
min 3
max 17
Name: PassportNumber, dtype: object

In the contracts to our preliminary trust, the shortest passport incorporates best Three characters whilst the longest is 17 (excess of the anticipated most of 9) characters lengthy.

Let’s make bigger our knowledge body with the 'len'column in order that we will take a look at examples:

# Add the 'len' column
df['len'] = df["PassportNumber"].observe(len)
# glance at the examples having the utmost lenght
[In]: df[df["len"]==df['len'].max()]
[Out]:
PassportNumber len
25109300000000000 17
27006100000000000 17
# glance at the examples having the minimal lenght
[In]: df[df["len"]==df['len'].min()]
[Out]:
PassportNumber len
179 3
917 3
237 3

The Three digit passport numbers glance suspicious, however they meet the ICAO standards, however the longest ones are patently too lengthy, on the other hand, they comprise reasonably many trailing zeros. Maybe somebody simply added the zeros in order to satisfy some knowledge garage necessities.

Let’s take a look on the general period distribution of our knowledge pattern.

# calculate rely of look of more than a few lengths
counts_by_value = df["len"].value_counts().reset_index()
separator = pd.Series(["|"]*df["len"].value_counts().form[0])
separator.identify = "|"
counts_by_index = df["len"].value_counts().sort_index().reset_index()
lenght_distribution_df = pd.concat([counts_by_value, separator, counts_by_index], axis=1)# draw the chart
ax = df["len"].value_counts().sort_index().plot(type="bar")
ax.set_xlabel("place")
ax.set_ylabel("selection of data")
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.05))
Distribution of the passport lengths of the information pattern

We see, that essentially the most passports quantity in our pattern, are 7, eight or Nine characters lengthy. Quite a couple of are on the other hand 10 or 12 characters lengthy, which is surprising.

Maybe the lengthy passports are having main or trailing zeros, like our instance with 17 characters.

In order to discover those zero-pads let’s upload two extra columns to our knowledge set — ‘leading_zeros’ and ‘trailing_zeros’ to comprise the selection of main and trailing zeros.

# selection of main zeros can also be calculated by way of subtracting the period of the string l-stripped off the main zeros from the entire period of the stringdf["leading_zeros"] = df["PassportNumber"].observe(lambda x: len(x) - len(x.lstrip("0")))# in a similar fashion the selection of the trailing zeros can also be calculated by way of subtracting the period of the string r-stripped off the main zeros from the entire period of the stringdf["trailing_zeros"] = df["PassportNumber"].observe(lambda x: len(x) - len(x.rstrip("0")))

Then we will simply show the passport which has greater than Nine characters to test if the have any main or trailing zeros:

[In]: df[df["len"]>9]
[Out]:
PassportNumber len leading_zeros trailing_zeros
73846290957 11 0 0
N614226700 10 0 2
WC76717593 10 0 0
...

Most of the passports in the set don’t have any zeros and nonetheless, they’re longer than Nine characters. Just for the sake of showing it, let’s take a look on the distribution of zeros on every place of the passport numbers.

We know that the shortest passport fills in Three positions and the longest have 17. Let’s iterate thru all of the passport numbers after which iterate thru all their characters to look the place are zeros.

passports_list = []
# for every passport quantity
for passport_number in df["PassportNumber"]:
# let's create a dictionary with the passport quantity
pos = {"PassportNumber": passport_number}
# and let's upload to every place if it is a 0 or no longer
for i, c in enumerate(passport_number):
# and for every place test if it is 0 -> True or one thing else --> False
pos[i+1] = True if c == "0" else False
passports_list.append(pos)
# let's flip the dictionary into pandas dataframe
zeros_distribution = pd.DataBody(passports_list)
zeros_distribution["len"] = zeros_distribution["PassportNumber"].observe(len)

The output of this operation is the brand new knowledge body 'zeros_distribution' which incorporates True on every place the place there’s 0 in the passport quantity

Zeros are marked as True on the place the place they seem

Notice that I’ve highlighted the True with yellow background. Styling in pandas can also be executed the usage of .taste way, however you will have to watch out. Styling will plot the entire dataset, which can also be time-consuming, so specify what number of rows you wish to have to show — e.g. by way of pattern(5). You can observe the manner to a couple columns best the usage of subset parameter.

# styling serve asdef highlight_true(val):
go back 'background-color: yellow' if val == 1 else ''
# observe the manner to every price, the usage of applymap
# subset for place columns 1-17
zeros_distribution.pattern(5).taste.applymap(highlight_true, subset=checklist(vary(1,18)))

By summing the zeros at every place we will see what number of zeros seem there. By counting all values on the place we can see, how time and again this place is crammed in in the dataset (e.g. place 17 is never crammed in), as a result of rely(N/A) is 0. We run those aggregations best, over columns 1–17 by way of vary(1,18)

# summing the zeros at every place.
zero_count = zeros_distribution[range(1,18)].sum()
zero_count.identify = "is_zero"
# rely how time and again this place is crammed in
position_filled_count = zeros_distribution[range(1,18)].rely()
position_filled_count.identify = "is_filled"
Distribution of zeros as a desk and charted for positions 10or above

We see that aside from for positions 16 and 17, the values aren’t trailing zeros. It’s clearly conceivable that those values are mixtures of each main and trailing zeros, however that’s no longer very possible. We have additionally already noticed the examples of passport numbers longer than Nine characters earlier than.

We can merely .strip("0") to take away each main and trailing zeros and we can see, that there are nonetheless invalid passport numbers longer than Nine characters:

df[“PassportNumber”].str.strip(“0”).observe(len).value_counts().sort_index().plot(type=”bar”)
Even when we strip zeros from each ends, some passport numbers are longer than Nine characters

To overview string patterns it’s useful to make use of the facility of the common expression. Even despite the fact that their syntax is just a little bulky, as soon as used as it should be, they may be able to be very fast and environment friendly.

For the start let’s upload a column which is True in case the passport quantity begins with a letter and False differently.

df["starts_with_letter"] = df["PassportNumber"].observe(lambda x: True if re.fit("^[a-zA-Z]+.*", x) else False)

The common expression ^[a-zA-Z]+.*" imply that

  • ^ originally
  • [a-zA-Z] is decrease case or capital letter
  • + a number of occasions
  • . adopted by way of any personality
  • * 0 or extra occasions

I love to show the consequences, no longer as a Serie best, however as a dataFrame, as a result of I generally wish to display additional information, for instance, the rely and the share like in the case of passports beginning with a letter.

start_with_letter_count = df["starts_with_letter"].value_counts()
pd.DataBody({"rely": start_with_letter_count,
"share": start_with_letter_count/df.form[0]}).taste.structure("{:.2%}", subset=["percentage"])
Count and share of passports which get started with a letter

One may additionally have an interest if there are letters best originally or if additionally they seem in the center of the passport quantity (after a minimum of 1 quantity). This can once more be simply solved by way of the common expressions:

  • ^ originally
  • . is one thing (letter, quantity or a personality)
  • * 0 or extra occasions
  • d then a bunch
  • + at least once occasions
  • [a-zA-Z] then a letter in decrease or capital case
  • + at least once occasions
df["letter_after_number"] = df["PassportNumber"].observe(lambda x: "Leter after quantity" if re.fit("^.*d+[a-zA-Z]+", x) else "trailing numbers best"

If we show the rely and the proportion the usage of the similar code as above, we can see that almost all of the passports don’t have any letter in the center.

Less than 10% of passport numbers in our dataset have a letter in the center

Another query, once we had been designing the passport development for our device was once what number of letters can seem at the starting of the string. In order to determine let’s devise a easy serve as the usage of regex.

def lenght_of_start_letter_sequence(string):
# re.fit returns None in case no fit is located, and making use of of .workforce(0) would result in an error
if re.fit("^[a-zA-Z]+", string) isn't None:
go back len(re.fit("^[a-zA-Z]+", string).workforce(0))
else:
go back 0

^[a-zA-z]+ imply that there are a number of letters originally. .workforce(0) way of the re.fit returns the primary workforce which goes this common expression. So in case of ABC123 it could go back ABC and we can merely rely its period. The best factor is that re.fit returns None and workforce() fails in case the development isn’t met, so we need to maintain it for events when the passport incorporates best numbers.

Once we’ve got our serve as, it’s just a topic of observe ing it to the passportNumber column:

df["lenght_of_start_letters"] = df["PassportNumber"].observe(lenght_of_start_letter_sequence)

In this situation, let’s no longer display the statistics, however let’s checklist the values according to the period of letter series originally:

df.sort_values(by way of="lenght_of_start_letters", ascending=False).loc[df["lenght_of_start_letters"]>0,["PassportNumber","lenght_of_start_letters"]]
The longest letter series originally is 6 letters, some passports have 4

You are once in a while requested to supply instance values, and it’s simply executed, by way of the pattern() way.

# Five examples of the passports which get started with Three letters
[In]: df[df["lenght_of_start_letters"]==3]["PassportNumber"].pattern(5).to_list()
[Out]: ['DZO795085', 'SNJ370118', 'UJR13307234', 'DSG229101', 'VAA353972']

There’s an opportunity that the information provider has added a prefix to the values, which was once no longer initially incorporated in the passport quantity. For maximum functions, any such prefix will have to be got rid of. But easy methods to to find it?

Let’s first suppose, that our prefix would have Three characters. Then we will simply slice the primary Three characters and value_counts to look the ones that are the commonest:

[In]: df["PassportNumber"].str.slice(forestall=3).value_counts()
[Out]:
000 41
009 37
005 33

We can see that of the values seem a lot more ceaselessly than the common incidence of the three letter prefix (.imply() carried out to the code above is two.3). 009 seems 37 occasions which is a lot more ceaselessly than 2.three times.

You can make bigger your research and test if the prefix seems just for sure lengths. We can suppose that passports having 12 characters have a minimum of Three letter prefix. The following code will conceal that prefix 932 is a lot more not unusual for 12 characters lengthy passports than same old.

c = df[["PassportNumber", "len"]]
c["prefix"] = c["PassportNumber"].str.slice(forestall=3)
# workforce by way of each prefix and the entire period of the passport
prefix_by_length_df = c.groupby(["prefix", "len"]).length().reset_index()
prefix_by_length_df[prefix_by_length_df["len"]==12].sort_values(by way of=0, ascending=False)

The values 932 seems 27 occasions, whilst the common for 12 personality longs is only one.3. You can rerun those easy size-counts to shorter or longer prefixes or restrict to some other traits.

If you have got delicate knowledge, however you want to percentage them with a knowledge scientist, maximum shoppers would conform to percentage the anonymized knowledge or randomized pattern. In our case, let’s take a look at easy randomization, which:

  • Keeps main and trailing zeros
  • Change any quantity for a random quantity
  • Change any letter for a random letter
import random
import string

def passport_randomizer(list_of_passports):
processed = {} # dictionaly which can stay the {"previous": "new price"}
# loop thru all of the values
for p in list_of_passports:
leading_zeros_up_to = len(p) - len(p.lstrip("0"))
trailing_zeros_up_to = len(p) - (len(p) - len(p.rstrip("0")))
out = []
for i, c in enumerate(p):
# stay the main and trailing zeros intact
if i < leading_zeros_up_to or i >= trailing_zeros_up_to:
out.append(c)
# then alternate any quantity to a random quantity
elif c.isnumeric():
out.append(str(random.randint(0,9)))
# in spite of everything the remainder for a random letter
else:
out.append(random.selection(string.ascii_letters).higher())
processed[p] = "".sign up for(out)
go back processed

Such a serve as is going personality by way of personality and adjustments the values to new random ones. The result’s a dictionary {"previous price": "new price"}

{'0012300': '0050100',
'ABC': 'LNZ',
'00EFG': '00AQT',
'IJK00': 'KVP00',
'012DF340': '032DT030'}

This way assists in keeping the main/trailing zeros for an research of the randomized knowledge, on the other hand, it could reset any repeating prefix, any such way could be extra complicated and you’ll take a look at that at house.

Having the mapping of the previous to new values, we will merely .map() it to the checklist of passports to obtain a brand new randomized checklist, which assists in keeping lots of the traits of our authentic set.

df["PassportNumber"].map(passport_randomizer(df["PassportNumber"].distinctive())).to_csv(r"new_data.csv", index=False, header=True)

The serve as is carried out to the .distinctive() passports to steer clear of the similar duplicated passport being became two other new random values. It’s nonetheless conceivable that two other passports would turn out to be the similar. To steer clear of that chance, you would need to test if the brand new random price isn’t already used in the present mapping all over the introduction of every new string.

In this newsletter, we’ve got reviewed easy methods to analyze an inventory of alphanumeric strings to expose their patterns. The ways we’ve got used are the commonest necessities in every knowledge research or knowledge cleansing process and the can also be tweaked to hide an enormous vary of identical duties like:

  • counting the period of the string
  • checking the main and trailing zeros
  • reviewing the incidence of a particular personality (0) at a particular place
  • reviewing what number of values get started with a letter
  • reviewing what number of has a letter in the center
  • analyzed if there’s a not unusual prefix
  • randomize the dataset

Feel unfastened to play with the information your self:

LEAVE A REPLY

Please enter your comment!
Please enter your name here