Reading a Text File for Machine Learning
Transforming Text Files to Data Tables with Python
A reusable arroyo to extract data from any text file
In this article, I depict how to transform a set of text files into a information table which can be used for natural language processing and auto learning. To showcase my approach I use the raw BBC News Article dataset published by D. Greene and P. Cunningham in 2006.
Before jumping into the IDE and start coding, I commonly follow a process consisting of understanding the data, defining an output, and translating everything into lawmaking. I consider the tasks earlier coding ordinarily as the nigh important since they assist to structure and follow the coding process more efficiently.
1. Data Understanding
Exist f ore being able to extract any information from a text file, we want to know how its information is structured every bit well equally how and where the text files are stored (east.m. proper name, directory).
Structure
To understand the structure, nosotros take a expect at some of the text file to go a sense of how the information is structured.
Claxton hunting showtime major medalBritish hurdler Sarah Claxton is confident she can win her starting time major medal at next month's European Indoor Championships in Madrid.
The 25-yr-one-time has already smashed the British record over 60m hurdles twice this season, setting a new mark of 7.96 seconds to win the AAAs title. "I am quite confident," said Claxton. "But I have each race every bit it comes. "Equally long equally I keep up my grooming merely not exercise too much I think at that place is a chance of a medal." Claxton has won the national 60m hurdles title for the past three years just has struggled to translate her domestic success to the international phase.
...
In the context of news articles, it can be easily assumed that the first and second section stand for to the title and the subtitle respectively. The following paragraphs represent the article's text. Looking at the sample information, we likewise recognise that the segments are separated by new lines which can be used for splitting the text.
Storage
To write a script that automatically runs through every text file, we need to know how the text files are stored. Thus, we are interested in the naming and organisation of the directories. Potentially, we demand to restructure things, so we tin loop more than hands through the files.
Luckily for us, the BBC news dataset is already well structured for automating the information extraction. Every bit it can be seen in the screenshots above, the text files are stored in directories according to their genre. The names are too like for every genre and are fabricated up by leading zeros (if the file number is beneath 100), the file number, and ".txt".
2. Output Definition
Based on the insights of the data agreement step, we can define what information should be included in the output. In lodge to determine the output, we take to consider the learnings of the previous pace as well as think about potential utilise cases for the output.
Based on the information we can potentially extract from the text files, I come up with 2 different utilize cases for automobile learning training:
- Text classification (genre prediction based on the text)
- Text generation (title or subtitle generation based on the text)
In order to fulfil the requirements for both potential utilise cases, I would propose extracting the following information.
I would also include the length of the text (in number of tokens), to brand it easier to filter for shorter or longer texts afterwards. To shop the extracted data, I would advise a tab-separated-values (.tsv) file, since commas or semicolons could be nowadays in the text column.
3. Coding
Thanks to the previous steps, we know the data we are dealing with and what kind of information we desire to output at the end of the transformation process. As you might know by now, I like to intermission tasks into smaller parts. The coding step doesn't constitute an exception to this :) Generally, I would split the coding into at least three different parts and wrap them in private functions:
- Reading and splitting a file
- Extracting the data
- Building the data frame
In lodge to make this news article extractor reusable, I create a new class that implements the functions.
Reading and splitting a file
In order to read a file with python, we need the corresponding path consisting of the directory and the filename. As we observed in the Data Understanding step, the files are stored in their corresponding genre's directory. This means that to access a file, we demand the base of operations path ('data' for me), its genre and its name.
If the file exists, nosotros want to read it, split up it past the new line characters ('\n'), filter empty strings and return the remaining text sections as a list. In the case, that the file doesn't be (eastward.grand. the file number is larger than the number of available files), nosotros desire to render an empty list. I prefer this rather than working with exceptions or returning none, if the file doesn't exist.
def read_and_split_file(self, genre: str, file_name: str) -> list:
text_data = list()
current_file = os.path.abspath(bone.path.join('data', genre, file_name))
if os.path.exists(current_file):
open_file = open(current_file, 'r', encoding="latin-ane")
text_data = open_file.read().split('\n')
text_data = list(filter(None, text_data))
render text_data Every bit you lot can see in the code to a higher place, uses the bone package. Thus, we demand to import this package.
Extracting the information
In lodge to extract the information of the text files and prepare these for the side by side pace, I would suggest pursuing this for every genre. This means, that we loop over every file in the corresponding genre's directory. Past keeping a current_number variable, we can format the filename with the leading zeros and and so read and split up the file by calling the in a higher place-implemented method.
If the returned list is empty, we want to stop the loop, since this ways, that nosotros reached the end of the loop and that at that place aren't any new files left in the directory.
Otherwise, nosotros add the returned information of the reading and splitting function to specific information containers, such every bit titles, subtitles, and texts. Since I suggested to also provide the token count of the text in the final output, nosotros tin use the nltk packet to tokenize the text and add the length of the list of tokens to our token_counts list. Finally, we increment the current_number by 1 to continue the extraction process with the next file.
def extract_genre_files(self, genre: str) -> pd.DataFrame:
institute = Truthful
current_number = 1
titles = list()
subtitles = listing()
texts = listing()
token_counts = listing()
while found:
file_name = "{:03d}.txt".format(current_number)
text_data = cocky.read_and_split_file(genre, file_name)
if len(text_data) != 0:
titles.suspend(text_data[0])
subtitles.append(text_data[1])
article_text = ' '.join(text_data[2:])
texts.append(article_text)
token_counts.suspend(len(nltk.word_tokenize(article_text)))
current_number += 1
else:
found = False genres = [genre] * len(titles)
data = {'genre': genres, 'title': titles, 'subtitle': subtitles, 'text': texts, 'token_counts': token_counts}
data_frame = pd.DataFrame(data)
return data_frame
Later on finishing the loop through the genre files, we create a data frame based on the extracted information that was stored inside the specific lists. Similar to the previous step, we need to import two packages (nltk and pandas). Please also make certain, that you lot have downloaded the 'punkt' data of the nltk package, since it is required to tokenize texts.
import nltk
# nltk.download('punkt')
import pandas every bit pd Building the data frame
In a final footstep, we have to create a loop over the existing genres, extract the data per genre by calling the to a higher place-implemented method, concatenating the output for every genre and finally saving the concatenated data frame as a csv with the desired separator.
def transform_texts_to_df(self, proper name, genre_list, delimiter = '\t'):
article_df_list = list()
for genre in genre_list:
article_df_list.append(self.extract_genre_files(genre))
df = pd.concat(article_df_list)
df.to_csv(proper noun, sep=delimiter)
return df After implementing the class and its method, nosotros need to create an example of the ArticleCSVParser form and call the transform_texts_to_df method by providing the desired proper noun for the resulting csv and a listing containing every genre. Et voilà.
if __name__ == "__main__":
genre_list = ['concern', 'entertainment', 'politics', 'sport', 'tech']
parser = ArticleCSVParser()
df = parser.transform_texts_to_df('bbc_articles.csv', genre_list)
print(df.head()) Conclusion
In this article, I showed how to transform text files into a data frame and salvage it equally a csv/tsv. In gild to reuse the class for a different information set, just create a new class that inherits from the ArticleCSVParser and override the methods that accept to be inverse.
Yous can detect the complete code and dataset also in this repository.
I hope you enjoyed and happy coding!
hamiltontiand1952.blogspot.com
Source: https://towardsdatascience.com/transforming-text-files-to-data-tables-with-python-553def411855
0 Response to "Reading a Text File for Machine Learning"
Postar um comentário