cutting rows and sorting them in order
I have a list of 13 gene identifiersi.e. (>ENSMUSG000000
Question information
- Language:
- English Edit question
- Status:
- Solved
- Assignee:
- No assignee Edit question
- Solved by:
- avkapur
- Solved:
- Last query:
- Last reply:
Revision history for this message
![]() |
#1 |
I have a hard time imagining how your text looks like. Can you post a snippet? It would help.
Revision history for this message
![]() |
#2 |
Here you go, hope it helps, thanks.
>ENSMUSG00000000159
Hoxb1(3) 4 - 13 - tcatttctct 3.36
Hoxb1(2) 4 - 13 - tcatttctct 3.19
Hoxb1(3) 76 - 85 - acctctatcc 3.51
Hoxb1(2) 76 - 85 - acctctatcc 3.34
Hoxb1(2) 80 - 89 - ctatccaggc 2.73
Hoxb1(3) 80 - 89 - ctatccaggc 2.91
Hoxb1(2) 84 - 93 - ccaggctcct 2.52
Hoxb1(3) 84 - 93 - ccaggctcct 2.69
Hoxb1(2) 91 - 100 - ccttcctcct 4.84
Hoxb1(3) 91 - 100 - ccttcctcct 4.75
Hoxb1(1) 91 - 100 - ccttcctcct 4.22
Hoxb1(3) 111 - 120 + agacgcctgc 2.59
Hoxb1(2) 111 - 120 + agacgcctgc 2.42
Hoxb1(2) 133 - 142 - caaacgctct 3.24
Hoxb1(3) 133 - 142 - caaacgctct 3.41
Hoxb1(3) 162 - 171 + agacaaaagg 3.43
Hoxb1(2) 162 - 171 + agacaaaagg 3.26
Hoxb1(2) 166 - 175 + aaaaggatct 3.38
Hoxb1(3) 166 - 175 + aaaaggatct 3.55
Hoxb1(3) 172 - 181 - atctctatct 3.55
Hoxb1(2) 172 - 181 - atctctatct 3.38
Hoxb1(3) 176 - 185 - ctatctagtt 3.26
Hoxb1(2) 176 - 185 - ctatctagtt 3.09
Hoxb1(3) 614 - 623 - tcatgctttt 3.07
Hoxb1(2) 614 - 623 - tcatgctttt 2.9
Hoxb1(3) 619 - 628 - cttttcctct 3.14
Hoxb1(2) 619 - 628 - cttttcctct 2.97
Hoxb1(2) 670 - 679 - ccagctcact 3.02
Hoxb1(3) 670 - 679 - ccagctcact 3.19
Hoxb1(3) 675 - 684 - tcacttatct 3.55
Hoxb1(2) 675 - 684 - tcacttatct 3.38
Hoxb1(2) 681 - 690 + atctcgaagg 2.66
Hoxb1(3) 681 - 690 + atctcgaagg 2.83
Hoxb1(2) 685 - 694 + cgaaggctgt 2.16
Hoxb1(3) 685 - 694 + cgaaggctgt 2.33
Hoxb1(3) 711 - 720 + agagggaacc 2.71
Hoxb1(2) 711 - 720 + agagggaacc 2.54
Hoxb1(2) 748 - 757 + agaggcaggt 2.06
Hoxb1(3) 748 - 757 + agaggcaggt 2.23
Hoxb1(1) 754 - 763 + aggtggatgg 8.4
Hoxb1(3) 754 - 763 + aggtggatgg 6.69
Hoxb1(2) 754 - 763 + aggtggatgg 7.32
Hoxb1(3) 758 - 767 + ggatggccag 2.02
Hoxb1(3) 831 - 840 + agggcgattg 2.04
Hoxb1(2) 848 - 857 + atatgggcag 2.37
Hoxb1(3) 848 - 857 + atatgggcag 2.54
Hoxb1(2) 879 - 888 + aaaaggctcg 2.85
Hoxb1(3) 879 - 888 + aaaaggctcg 3.02
Hoxb1(2) 967 - 976 + agaaggaaag 4.88
Hoxb1(3) 967 - 976 + agaaggaaag 4.79
Hoxb1(1) 967 - 976 + agaaggaaag 4.26
>ENSMUSG00000023800
Hoxb1(3) 72 - 81 - ctattcatca 4.01
Hoxb1(2) 72 - 81 - ctattcatca 4.11
Hoxb1(1) 72 - 81 - ctattcatca 3.49
Hoxb1(2) 340 - 349 - acatctttca 2.19
Hoxb1(3) 340 - 349 - acatctttca 2.36
Hoxb1(3) 343 - 352 - tctttcatca 2.41
Hoxb1(2) 343 - 352 - tctttcatca 2.24
Hoxb1(3) 347 - 356 - tcatcaatgg 2.53
Hoxb1(2) 347 - 356 - tcatcaatgg 2.36
Hoxb1(2) 347 - 356 + tcatcaatgg 2.36
Hoxb1(3) 347 - 356 + tcatcaatgg 2.53
Hoxb1(3) 378 - 387 - ctctgcctct 3.36
Hoxb1(2) 378 - 387 - ctctgcctct 3.19
Hoxb1(2) 387 - 396 + tgttgtatag 2.03
Hoxb1(3) 387 - 396 + tgttgtatag 2.2
Hoxb1(2) 433 - 442 + aaatggctcc 2.67
Hoxb1(3) 433 - 442 + aaatggctcc 2.84
Hoxb1(3) 493 - 502 + aggggggtgg 5.09
Hoxb1(2) 493 - 502 + aggggggtgg 5.18
Hoxb1(1) 493 - 502 + aggggggtgg 4.56
Hoxb1(3) 494 - 503 + ggggggtggg 2.07
Hoxb1(3) 528 - 537 + agaatgaggg 4.23
Hoxb1(2) 528 - 537 + agaatgaggg 4.33
Hoxb1(1) 528 - 537 + agaatgaggg 3.71
Hoxb1(2) 554 - 563 - caaaccatgg 2.62
Hoxb1(3) 554 - 563 - caaaccatgg 2.79
Hoxb1(2) 606 - 615 - cattccatct 6.38
Hoxb1(3) 606 - 615 - cattccatct 6.01
Hoxb1(1) 606 - 615 - cattccatct 6.61
Hoxb1(3) 610 - 619 - ccatctgtgg 3.17
Hoxb1(2) 610 - 619 - ccatctgtgg 3
Hoxb1(1) 640 - 649 + acaaggatga 3.49
Hoxb1(3) 640 - 649 + acaaggatga 4.02
Hoxb1(2) 640 - 649 + acaaggatga 4.11
Hoxb1(3) 643 - 652 + aggatgatag 2.34
Hoxb1(2) 643 - 652 + aggatgatag 2.17
Hoxb1(3) 644 - 653 + ggatgatagg 2.63
Hoxb1(2) 644 - 653 + ggatgatagg 2.46
Hoxb1(3) 647 - 656 + tgataggtgt 2.44
Hoxb1(2) 647 - 656 + tgataggtgt 2.27
Hoxb1(3) 696 - 705 - gcagccctgg 2.07
Hoxb1(2) 737 - 746 - ccagactcct 3.09
Hoxb1(3) 737 - 746 - ccagactcct 3.26
Hoxb1(3) 745 - 754 - ctatggaact 2.53
Hoxb1(2) 745 - 754 - ctatggaact 2.36
Hoxb1(2) 752 - 761 + acttcggtgg 2.93
Hoxb1(3) 752 - 761 + acttcggtgg 3.1
Hoxb1(2) 765 - 774 + tggtggattg 4.47
Hoxb1(1) 765 - 774 + tggtggattg 3.85
Hoxb1(3) 765 - 774 + tggtggattg 4.37
Hoxb1(1) 769 - 778 + ggattgctgg 4.15
Hoxb1(3) 769 - 778 + ggattgctgg 4.68
Hoxb1(2) 769 - 778 + ggattgctgg 4.78
Hoxb1(3) 801 - 810 - gcagccccgt 2.13
Hoxb1(2) 807 - 816 - ccgtgtctct 3.12
Hoxb1(3) 807 - 816 - ccgtgtctct 3.29
Hoxb1(3) 823 - 832 + agcatggtgg 2.98
Hoxb1(2) 823 - 832 + agcatggtgg 2.81
Hoxb1(3) 827 - 836 + tggtggcagg 3.27
Hoxb1(2) 827 - 836 + tggtggcagg 3.1
Hoxb1(2) 855 - 864 - ccaggcagcc 3.57
Hoxb1(3) 855 - 864 - ccaggcagcc 3.74
Hoxb1(1) 855 - 864 - ccaggcagcc 2.09
Hoxb1(3) 903 - 912 - cgttccatca 4.6
Hoxb1(2) 903 - 912 - cgttccatca 4.7
Hoxb1(1) 903 - 912 - cgttccatca 4.08
Hoxb1(3) 907 - 916 + ccatcaatgg 2.84
Hoxb1(2) 907 - 916 - ccatcaatgg 4.58
Hoxb1(2) 907 - 916 + ccatcaatgg 2.67
Hoxb1(1) 907 - 916 - ccatcaatgg 3.96
Hoxb1(3) 907 - 916 - ccatcaatgg 4.49
Hoxb1(2) 961 - 970 - atatccagca 2.38
Hoxb1(3) 961 - 970 - atatccagca 2.55
Hoxb1(2) 978 - 987 - ggatccgttt 2.69
Hoxb1(3) 978 - 987 - ggatccgttt 2.86
>ENSMUSG00000024087
Hoxb1(3) 72 - 81 + agggagatga 2.6
Hoxb1(2) 72 - 81 + agggagatga 2.43
Hoxb1(3) 76 - 85 + agatgactgg 6.04
Hoxb1(2) 76 - 85 + agatgactgg 6.4
Hoxb1(1) 76 - 85 + agatgactgg 6.63
Hoxb1(3) 159 - 168 - caacccaacc 2.63
Hoxb1(2) 159 - 168 - caacccaacc 2.46
Hoxb1(3) 163 - 172 - ccaaccgttt 4.69
Hoxb1(1) 163 - 172 - ccaaccgttt 4.16
Hoxb1(2) 163 - 172 - ccaaccgttt 4.78
Hoxb1(3) 168 - 177 + cgtttgttgg 2.93
Hoxb1(2) 168 - 177 + cgtttgttgg 2.76
Hoxb1(3) 189 - 198 - gcctcctcct 2.44
Hoxb1(2) 189 - 198 - gcctcctcct 2.27
Hoxb1(3) 195 - 204 - tcctcccact 2.82
Hoxb1(2) 195 - 204 - tcctcccact 2.65
Hoxb1(3) 200 - 209 - ccactcaccg 2.32
Hoxb1(2) 200 - 209 - ccactcaccg 2.14
Hoxb1(3) 208 - 217 - cgccccctct 2.08
Hoxb1(2) 301 - 310 + aggtggcaga 2.13
Hoxb1(3) 301 - 310 + aggtggcaga 2.3
Hoxb1(3) 337 - 346 - cttgccttct 3.16
Hoxb1(2) 337 - 346 - cttgccttct 2.99
Hoxb1(2) 362 - 371 + agaaggatgg 8.17
Hoxb1(3) 362 - 371 + agaaggatgg 7.55
Hoxb1(1) 362 - 371 + agaaggatgg 9.25
Hoxb1(3) 402 - 411 - gtttccaaat 2.05
Hoxb1(2) 415 - 424 + agaggtagag 2.43
Hoxb1(3) 415 - 424 + agaggtagag 2.6
Hoxb1(2) 423 - 432 + agatctgtgc 2.71
Hoxb1(3) 423 - 432 + agatctgtgc 2.88
Hoxb1(2) 459 - 468 + aaatgctttg 3.45
Hoxb1(3) 459 - 468 + aaatgctttg 3.62
Hoxb1(3) 479 - 488 + agataaagat 2.14
Hoxb1(3) 481 - 490 + ataaagataa 2.51
Hoxb1(2) 481 - 490 + ataaagataa 2.08
Hoxb1(2) 485 - 494 + agataaatgc 5.23
Hoxb1(3) 485 - 494 + agataaatgc 5.14
Hoxb1(1) 485 - 494 + agataaatgc 4.61
Hoxb1(3) 489 - 498 + aaatgcaagg 4.7
Hoxb1(1) 489 - 498 + aaatgcaagg 4.17
Hoxb1(2) 489 - 498 + aaatgcaagg 4.79
Hoxb1(2) 571 - 580 - gcaccaaact 2.74
Hoxb1(3) 571 - 580 - gcaccaaact 2.91
Hoxb1(2) 581 - 590 - cgagtcacct 2.43
Hoxb1(3) 581 - 590 - cgagtcacct 2.6
Hoxb1(2) 587 - 596 + acctggagga 2.21
Hoxb1(3) 587 - 596 + acctggagga 2.38
Hoxb1(2) 593 - 602 + aggaggaggg 3.53
Hoxb1(1) 593 - 602 + aggaggaggg 2.91
Hoxb1(3) 593 - 602 + aggaggaggg 3.43
Hoxb1(2) 635 - 644 + agaggcgttg 2.19
Hoxb1(3) 635 - 644 + agaggcgttg 2.36
Hoxb1(2) 709 - 718 + gggaggatga 2.05
Hoxb1(3) 709 - 718 + gggaggatga 2.22
Hoxb1(2) 713 - 722 + ggatgactga 2.57
Hoxb1(3) 713 - 722 + ggatgactga 2.74
Hoxb1(3) 757 - 766 + aggcggatcc 2.08
Hoxb1(3) 774 - 783 - gctgccagct 2.22
Hoxb1(2) 774 - 783 - gctgccagct 2.05
Hoxb1(3) 778 - 787 - ccagctattg 2.96
Hoxb1(2) 778 - 787 - ccagctattg 2.79
Hoxb1(3) 801 - 810 - ccagccagag 2.1
Hoxb1(3) 847 - 856 + gcacggatgt 2.44
Hoxb1(2) 847 - 856 + gcacggatgt 2.27
Hoxb1(3) 858 - 867 - agacccaact 2.91
Hoxb1(2) 858 - 867 - agacccaact 2.74
Hoxb1(3) 867 - 876 - tcacagatct 3.33
Hoxb1(2) 867 - 876 - tcacagatct 3.16
Hoxb1(3) 871 - 880 + agatctgtgg 4.45
Hoxb1(2) 871 - 880 + agatctgtgg 4.55
Hoxb1(1) 871 - 880 + agatctgtgg 3.93
Hoxb1(3) 897 - 906 - ccgcccctca 2.02
Hoxb1(3) 910 - 919 - cccaccccct 2.1
Hoxb1(3) 932 - 941 - cctccagtct 2.74
Hoxb1(2) 932 - 941 - cctccagtct 2.57
Hoxb1(3) 968 - 977 + agagggatgg 7.17
Hoxb1(2) 968 - 977 + agagggatgg 7.8
Hoxb1(1) 968 - 977 + agagggatgg 8.88
Revision history for this message
![]() |
#3 |
avkapur,
I am assuming this is text in a console, and not data in openoffice. The solution is awk, and my awk is a little rusty.
If you send a small sample of the data to me, I may be able to work up a solution for you tonight.
You can scrub the data, just make sure the respective fields contain similar representative characters. I.e. numbers where there should be numbers, alphabetic characters where there should be alphabetic characters. Use the same punctuation and whitespace that is used in the original data.
Two gene identifiers with 5 rows of data apiece should be sufficient. Make sure the last row (per gene identifier) is similar to the last row of the actual data.
You can find my email address by clicking my name, and then the overview tab.
Thanks,
Harvey
Revision history for this message
![]() |
#5 |
The original message was received at Tue, 15 Jul 2008 12:44:22 -0400 (EDT)
from ms4.nss.udel.edu [128.175.1.24]
----- The following addresses had permanent delivery errors -----
<hmuller@mlab-1420>
----- Transcript of session is unavailable -----
Revision history for this message
![]() |
#6 |
>ENSMUSG00000000159
Hoxb1(3) 4 - 13 - tcatttctct 3.36
Hoxb1(2) 4 - 13 - tcatttctct 3.19
Hoxb1(3) 76 - 85 - acctctatcc 3.51
Hoxb1(2) 76 - 85 - acctctatcc 3.34
Hoxb1(2) 80 - 89 - ctatccaggc 2.73
>ENSMUSG00000023800
Hoxb1(3) 72 - 81 - ctattcatca 4.01
Hoxb1(2) 72 - 81 - ctattcatca 4.11
Hoxb1(1) 72 - 81 - ctattcatca 3.49
Hoxb1(2) 340 - 349 - acatctttca 2.19
Hoxb1(3) 340 - 349 - acatctttca 2.36
Hoxb1(3) 343 - 352 - tctttcatca 2.41
Revision history for this message
![]() |
#7 |
Ya, that was his SSH key id, I think. mlab must be his machine.
Anyhow, I think sorting with Awk is not so quick to implement. Maybe it’s cleaner, but working on the shell, I usually do it quick and dirty. So here is my proposal, maybe Harv can top this.
Unix can sort with "sort". But you have blocks to sort, so break them up and sort those.
rm -f sorted_file
cat unsorted_file | sed -e's/[[
for geneid in xx*
do
sort -k7 $geneid >> sorted_file
done
rm -v xx???
1. the "sed" part normalizes the white space, you sample was not well aligned in columns. I make them tab separated here.
2. the csplit part now writes a file for every gene id block (xx00, xx01, xx02,...). This may be seen as a disadvantage.
3. sort every small file (i.e., xx* ) and append sorted lines to the new file named sorted_file, or similar.
This is how I'd do it after thinking 10minutes about it.
Revision history for this message
![]() |
#8 |
Maybe a one-liner is better for you?
ruby -00ne'puts split("
I googled this here:
http://
And it's so neat and compact that I like it. You need to install the package ruby for this, of course.
Revision history for this message
![]() |
#10 |
Great solution Tobias! I'm a basic shell scriptor. Python, perl, and ruby, are scripting languages I have yet to tackle.
avkapur, that was my ssh key id, as Tobias mentioned. If you don't mind, please forward the text again to hlmuller -AT- yahoo -DOT- com.
I still want to refresh my awk skills, and the raw text as an attachment is the best to experiment with. I'll add the awk solution as a bonus.
Thanks!
Harvey