> text_string = origin_text.replace("/", ' ').replace(',', ' ').replace('?', ' ').replace('!', ' ').replace(':', ' ').strip() > text_string From (Keith Allan Schneider)\nSubject Re >The "`little\' things" above were in reference to Germany clearly. We also can’t replace period since it is used to join email addresses. Many compound words will lose their meaning if a hyphen is removed. ?!:) Punctuation such as hyphen is not included because it is often used to join words or parts of words. Remove only punctuation that separates sentences: here we should only remove some punctuation that separates sentences and not part of the token.Here are the steps we will take to process the text: Do you think that the motto is a "little thing"\nthat\nwill lead to worse things?\nkeith At least, they are orders\nworse than the\nmotto. \nThe other was the system of social ranks that were used\n>in\nImperail Germany and Austria to distinguish Jews from the\nrest \n>of the population.\nThese don't seem like "little\nthings" to me. One was the rather\n>pevasive anti\nsemitism in German Christianity well before Hitler\n>arrived. People\n>said that there were similar things\nin Germany, but no one could name any.\n>That's not true. Origin text: > origin_text From: (Keith Allan Schneider) Subject: Re: >The "`little' things" above were in reference to\nGermany, clearly. We will be showing how to achieve this step-wise. Text after processing: > text_process(newsgroups_train.data) keith allan schneider pompous ass organization california institute technology pasadena line 16 nntp-posting-host jon livesey write little thing reference germany clearly people say similar thing germany one could name be true give two example one rather pevasive anti-semitism german christianity well hitler arrive system social rank use imperail germany austria distinguish jews rest population seem like little thing least order bad motto think motto little thing lead bad thing keith Text before processing: > newsgroups_train.data From: (Keith Allan Schneider) Subject: Re: >The "`little' things" above were in reference to\nGermany, clearly. Let’s compare the text processing result with the original text: Feel free to add more features fitting to your own domain. In this blog, I will use the Newsgroup text dataset as an example and introduce a one-stop solution that can be quickly implemented in your text cleaning pipeline.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |