Mindtwist.de

...let your mind twist!

How to create a keyword summary of a text with Python

If you are a serious filesystem janitor, you want to organize your files so that you can easily retrieve them at any time. To get better search results, you may want to include key words in your documents meta data (for example for the PDF files you create). Maybe you also want to screen a document from the commandline so that you get a quick overview. The python script in this article helps you by scanning a document or the standard input to print the top25 most used and significant words.

Another usecase for this Python script is to create rough keyword lists for your SEO of your websites.

The script

http://www.mindtwist.de/main/downloads.html?task=finish&cid=5&catid=3

#!/usr/bin/python
# This script scans a text file and creates tags of the most used words.
# The script has been released under BSD license. Copyright (C) 2010 \
 Reiner Rottmann <rottmannATTrottmann..it>

import string

def sort_by_value(d):
u""" Returns the keys of dictionary d sorted by their values """
items=d.items() m
backitems=[ [v[1],v[0]] for v in items]
backitems.sort()
return [ backitems[i][1] for i in range(0,len(backitems))]

def get_most_used_words(text, n):
""" Returns the top n used words in a text without counting the top 100 most \
used words in English and German language"""

# only print lower case words
text = text.lower()


# replace seperators with spaces
seperators = "\n\r\f\t\v.,/\\""''"
#seperators = "\n\r.,"
for seperator in seperators:
#text = text.replace(seperator, " ")
text = text.replace(seperator, " ")

# remove invalid chars
validchars = "abcdefghijklmnopqrstuvwxyz "

charlist = list(text.lower())
charlist = [char for char in charlist if char in validchars]
text = "".join(charlist)

# split up the words
words = text.split(" ")

# remove empty list words
words = [word for word in words if word != ""]

# sort words
words.sort()

# define some blacklists for German, English language etc.
blacklistde = "aber all als also an andere auch auf aus bei beispiel
bis da damit dann das dass denn der die dies doch du durch eigentlich ein er erste
 es fuer ganz geben gehen gross gut haben hier ich ihm ihn ihr immer in ja jahr jede
 jetzt kein koennen kommen lang lassen machen Mal man mehr mein mich mir mit muessen
 nach neu nicht noch nur oben oder sagen schon sehen sehr sein sein selbst sich sie
 so sollen stehen ueber um und uns unser unter viel von vor was weil wenn werden wie
 wieder wir wissen wo wollen zeit zu zwei"

blacklisten = "a about after again against all also and another any around as
 ask at back because become before begin between both but by call can change child
 come consider could course day develop do down during each early end even eye face
 fact feel few find first follow for form from general get give go good govern great
 group hand have he head help here high hold home house how however I if in increase
 interest into it just keep know large last late lead leave life like line little
 long look make man many may mean might more most move much must nation need never
new no not now number of off old on one only open or order other out over own part
 people person place plan play point possible present problem program public real
right run same say school see seem set she should show since small so some stand
state still such system take tell than that the then there these they thing think
this those through time to too turn under up use very want way we well what when
where which while who will with without word work world would write year you"

blacklistbash = "{ } && || $ alias break case continue do done elif else esac
 exit export fi for if in return set then unalias unset while /C2 halt ifconfig
 init initlog insmod linuxconf lsmod modprobe reboot rmmod route shutdown traceroute
 /C3 ] [ awk basename cat cp echo egrep fgrep gawk grep gzip kill killall less md
 mkdir mv nice pidof ps rd read rm rmdir sed sleep test touch ulimit uname usleep
 zcat zless"

blacklistcode = "aasmlang abbr above absolute abstract accept acceptcharset
 access accesskey acos acronym action activate additem address alert align alink
 alpha alter anchor angle append applet application apply archive area areab
arguments array ascending ascii asin assembler assert assign atan attach attr
 attribute attributes author authorization auto average axisbackground azaz
azazazaz azazfunction azazindent back basefont beep before begin behavior bell
 below between bgcolor bgsound binary bitand bitmap bitnot bitor bitset blank
 blink block blockquote blue body bold bool boolean border bottom brace browse
 button buttoncaption byte call cancel caption cast catch ccdefault ccic ccie
ccio ccir cdbl cdecl ceil ceiling cell cellspacing center chain change channel
char character characters charat charoff chars charset chdir chdrive check
checkbox checked checksum children chmod choice choose cint circle cite class
 classes classid clear click clip clipboard clng clock close closedir cluster
 code codebase codepage codetype colgroup collapse color colors cols colspan
column columns command commands comment comments commit common compact compare
 compile compiler component compress compute concat condition confirm connect
connected connection const constant constraint constructor container contains
content contents context control convert coordsdata copy corr cosh count country
create createobject cross csng curdir current currentdate currenttime cursor
 cursordatabase cycle database date dateadd datediff dateformat datepart datetime
 datevalue deallocate debug decimal declare decode default defer define defined
definition delay delete deletefile delimiter delimiters depth desc descending
 describe description dest detach device dialog dictionary diff difference
digits dimension direction directory disable disabled disconnect disk display
 distance distinct divide document doelse domain double doubleelse drag draw
drop dtem dump edit editor eject element ellipse elseif elseunindent elsif
 embedfieldset empty enable enabled encoding encrypt endcase enddo endfor
 endfunction endif endloop endm endselect endswitch endwhile enter entity
 entry enum environ environment equal erase error eruser escape eval evaluate
 even event every exact except exception exchange exclusive exec execute
 exist exists expand extendsfalse extensions extern external false fclose
feof ferror fetch fflush fgets field fields fieldset file fileclose filecopy
filedelete fileexists filename fileopen files filesize filetype fill filter
 final finally find findnext finish fixed flag fldata float flock floor
 flush focus fold font fontcolor fontsize fontstyle fopen force foreach
 foreign form format formula forward fputs frac frame frameborder frameset
 frameseth fread free freefile freq fseek ftell function functionget
functions functionsabs getattr getdate getday getenv getfile getfullyear
 gethours getminutes getmonth getobject gets getseconds getsize gettext gettime
 getutcdate getutcminutes getutcmonth glob global gosub goto gotoif grant
graphics gray grey grid group hash head header height help hidden hide high
history home host hour href hreflang hspace html htmli htmllang httpequivid
icdlevel icon identity ifdef ifndef iframe ignore ilayer image immediate
implements import incdelimiters include increment indent index indexes indexof
 info initial initialize inkey inline inner input inputbox insert instance instr
 integer interface internal interrupt intersect interval invert ipaddress isalpha
isarray isdate isdigit isempty isfinite isindex isindexkbd islower ismaplabel
 isnan isnull isnumber isnumeric isspace isupper item join kbdlabel keyf keys
 keywords label language large layer layout layoutmanager lcase lcdd leave
left leftb leftmargin legend length level library like line lineno lines link
 list listbox listen listing listingmap load local locale localtime locate
 location lock loge long longdescmailto lookup loop lower lowercase ltrim
 macro margin marginheight marginwidth marker marquee mask master match matd
math matrix maximum maxlength mean media median member memory menu menubar
 menuitem merge message messagebox meta method minimum minus minute miscformat
 mode model modify module month mouse mousedown mousemove move moveto multicol
 multicolnextid multiple multiplename names near newline newpage next nlssort
nobr nocase node noframes nohref nolayer nolist none noquote noresize norm normal
 noscript noscriptobject noshadeobject note notor nowait null number numeric
 numpad object offset omega onblur onchange onclick ondblclick onerror onfocus
 onkeydown onkeypress onkeyup online onload onmousedown onmousemove onmouseout
onmouseover onmouseup onreset onselect onsubmit onunload open opendir operators
 optgroup optgroupp optimize option optional options order otherwise outer
outfile output page pagesize pagewidth palette panel paragraph parallel param
 parameter parameters parent parse parseint part pascal password paste path
pathname pattern pause pecc peek perform pfcontrol pfpf picture pipe pitch
platform play plot plus pocon point poke poly polygon popen popup popupmenu
position post power pragma precision prefix preq prev previous primary print
 printer printf prior priority private privileges proc procedure process product
 profile program prompt promptreadonly properties property protected public
push query quit quote quoterange rand random randomize range rate readdir
 readfile readline readonly real receive record rect rectangle redim reference
 references refresh regexp region register relative release reload remote remove
rename repeat replace replicate report request reset resize resource response
 restart restore restrict result resume retry returns reverse revoke rewind right
 rightmargin role rollback root roots rotate round rowcount rows rowspan rset
 rtrim rulesscheme samp save scale scan schema scope screen script scroll
scrollbar scrolling search second sect section security seek segment select
 selected selection self send sendkeys sendmessage separate separator sequence
server servername session setattr setdate setfocus sethours setlocale setsize
settime setutchours setyear sfxa sfxb sfxc sfxd sfxe sfxf shape share shared
 shell shift short show sign signal signed single sinh size sizeof skip slice
 slider small smallint snapshot socket sort sound soundex source space spacer
spaces spacing span splice spline split sprintf spusercounter sqrt srand srcp
stack stage standard standby start stat state statement static statistics status
 stddev step stop storage store strcat strcmp stream strike string strings
strip strlen strong strstr struct structure stuff style stylesheet subc subject
 subset substitute substr substring subtract subtype summary super suptable
 svsv swap switch symbol symbols table tables tanh target task tazaz tazazfunction
 tbody tcdp tcecp tcon tcscp temp template term terminate text textarea textcolor
 textfield textheight textsize textwidth tfoot tfunction thead thenunindent thread
 throw timeout timer times timestamp timevalue timezone tindent title today tolower
 tolowercase topmargin tostring total toupper touppercase trans transaction transform
 translate trap trigger trim true trunc truncate type typeurl ucase ulvar umask
undef underline unindent union unique unlink unload unlock unpack unshift until
 update upper uppercase usemapvalign user userid userkey username using valid
validate value values valuetype varchar variable variables varwbrxmp vector verify
 version view visibility visible vlink wait warning wend width window word work
 wrap write"

blacklistuser = "does setup site talk usage uses versions working written your"

# be picky what to include in the word list
sorted = {}
for word in words:
# accept only words with lenght between 3 and 15
if len(word) > 3 and len(word) < 15:
# accept only words that are not in blacklists of the top 100 common
 words and in various user blacklists
if word not in blacklistde and word not in blacklisten and word
 not in blacklistuser and word not in blacklistbash and word not in blacklistcode:
sorted[word] = sorted.get(word,0) + 1

# use only the top n words
top = sort_by_value(sorted)[-n:]

# don't use the number of occurances as sort criteria anymore
top.sort()

return " ".join(top)

def usage():
""" This function prints the usage information"""
import sys
print """This script has been released under BSD license. Copyright (C) 2010
 Reiner Rottmann <rottmannATrottmann.it>

keywords.py analyzes a text file and prints a list of the top 25 used words.

Usage: ./keywords.py [textfile] [-]

Examples:
$ ./keywords.py kernel-parameters.txt
boot default device devices disable driver...

$ echo "Lorem ipsum..." | ./keywords.py -
assentior lorem molestie mollis mundi... """
sys.exit(1)

def main():
""" This is the main function"""
import sys
# check whether no commandline argument has been entered and show usage
if len(sys.argv) == 1:
usage()
else:
# read from stdin if "-" is present as first commandline argument
if sys.argv[1] == "-":
text = sys.stdin.read()
else:
# open file in any other case; should there be errors, print error
message and usage
try:
file = sys.argv[1]
fd = open(file, 'r')
text = fd.read()
fd.close()
except:
print "ERROR: Input file could not be opened."
usage()
# get top 25 used words
print get_most_used_words(text,25)

# run main function if called directly
if __name__=="__main__":
main()


The Script in Action

$ ./keywords.py 
This script has been released under BSD license. Copyright (C) 2010 Reiner
 Rottmann <rottmannATrottmannit>

keywords.py analyzes a text file and prints a list of the top 25 used words.

Usage: ./keywords.py [textfile] [-]

Examples:
$ ./keywords.py kernel-parameters.txt
boot default device devices disable driver...

$ echo "Lorem ipsum..." | ./keywords.py -
assentior lorem molestie mollis mundi...

$ ./keywords.py /etc/apache2/httpd.conf
addtype allow allows apache configuration deny directive directives documents
 errorlog example extra httpd ifmodule includes information libexec loadmodule
 logfile logged serverroot specific support virtualhost webserver

$ ./keywords.py /etc/sshd_config
account authentication banner configuration hostkey listenaddress passwords
 processing protocol sacl saclsupport sbin setting sshd sshdconfig them trust
tunneled uncommented usedns uselogin usepam xdisplayoffset xforwarding
xuselocalhost

 

Linux Magazine

Linux Magazine News
  • Rocks Releases Mamba

    The latest version of Rocks cluster distribution – an open source toolkit for real and virtual clusters – has been released.

  • PowerTOP Releases v2.0

    PowerTOP releases v2.0 of its Linux tool, with improved diagnostics and user interface.